# Verbatim Dicta Real-time audio transcription using Whisper AI with optional LLM-powered analysis. Captures system audio via loopback and transcribes it with configurable models and processing options. ## Features - Real-time transcription of system audio (Windows/Linux) - Multiple Whisper model sizes (tiny to large) - Multi-language support - **Sentence extraction mode** - Stitches audio chunks into complete sentences - Optional LLM analysis for fact-checking and question generation (via Ollama) - GPU acceleration support - Flexible audio device configuration ## Quick Start ```bash # Install dependencies pip install -r requirements.txt # Basic transcription (no LLM) python transcribe_speakers.py # With LLM analysis (optional) python transcribe_speakers.py --enable-llm # With sentence extraction python transcribe_speakers.py --sentence-mode # List audio devices python transcribe_speakers.py --list-devices ``` ## Requirements - **OS**: Windows 10/11 or Linux - **Python**: 3.8+ - **Audio**: Loopback device (Stereo Mix/VB-Cable on Windows, PulseAudio on Linux) - **Optional**: CUDA-capable GPU, Ollama for LLM features ## Installation ### 1. Install Dependencies ```bash pip install -r requirements.txt ``` ### 2. GPU Support (Optional) For CUDA 11.8: ```bash pip install torch==2.8.0+cu118 --index-url https://download.pytorch.org/whl/cu118 ``` For CUDA 12.1: ```bash pip install torch==2.8.0+cu121 --index-url https://download.pytorch.org/whl/cu121 ``` ### 3. Audio Loopback Setup **Windows - Option A (Stereo Mix):** 1. Right-click speaker icon → Sounds → Recording tab 2. Right-click → Show Disabled Devices 3. Enable and set Stereo Mix as default **Windows - Option B (VB-Cable, recommended):** 1. Download from [vb-audio.com](https://vb-audio.com/Cable/) 2. Install and restart 3. Use `--device "CABLE Output"` **Linux:** Configure PulseAudio loopback or use `transcribe_dual_linux.py` ### 4. LLM Features (Optional) ```bash # Install Ollama from ollama.ai ollama pull llama3.2 ``` ## Usage ### Available Scripts - `transcribe_speakers.py` - Main script with all features (LLM optional via `--enable-llm`) - `transcribe_dual_linux.py` - Linux-specific with dual audio support ### Common Commands ```bash # Quick start with GPU (English) ./RUN_GPU.sh # Dutch language ./RUN_DUTCH.sh # Dutch with LLM analysis ./RUN_DUTCH_LLM.sh # With LLM analysis ./RUN_GPU.sh --enable-llm # Save to file ./RUN_GPU.sh --output transcript.txt # Other languages (Spanish, French, German, etc.) ./RUN_GPU.sh --language es # Spanish ./RUN_GPU.sh --language fr # French ./RUN_GPU.sh --language de # German # Maximum accuracy with LLM and sentence extraction python transcribe_speakers.py --model large --enable-llm --sentence-mode --output enriched.txt # Force CPU (if GPU issues) python transcribe_speakers.py --force-cpu ``` ### Key Options | Option | Description | Default | |--------|-------------|---------| | `--model` | Model size: tiny/base/small/medium/large | base | | `--language` | Language code (en/es/fr/de/ja/etc.) | en | | `--device` | Audio device name (partial match) | Auto | | `--interval` | Processing interval (seconds) | 8.0 | | `--min-duration` | Minimum audio duration | 3.0 | | `--fast-mode` | Fast mode (3-5x faster, lower accuracy) | False | | `--enable-llm` | Enable fact-checking and questions | False | | `--llm-model` | Ollama model to use | llama3.2 | | `--output` | Save to file | None | | `--force-cpu` | Disable GPU | False | | `--gpu-index` | GPU device index | 0 | | `--sentence-mode` | Extract complete sentences from chunks | False | ## Model Performance | Model | Size | Speed | Quality | Best For | |-------|------|-------|---------|----------| | tiny | ~75 MB | Fastest | Basic | Quick tests, low-latency | | base | ~145 MB | Fast | Good | General real-time use | | small | ~485 MB | Moderate | Better | Balanced accuracy/speed | | medium | ~1.5 GB | Slow | Great | High accuracy needs | | large | ~3 GB | Slowest | Best | Maximum accuracy | ## Optimization Presets **Low Latency (Real-Time):** ```bash python transcribe_speakers.py --model tiny --fast-mode --interval 2 --min-duration 1.5 ``` **Balanced:** ```bash python transcribe_speakers.py --model base --interval 5 ``` **High Accuracy:** ```bash python transcribe_speakers.py --model large --interval 10 --enable-llm ``` ## Troubleshooting **No loopback device:** - Windows: Enable Stereo Mix or install VB-Cable - Linux: Configure PulseAudio loopback **CUDA errors:** ```bash python transcribe_speakers.py --force-cpu ``` **No audio captured:** - Verify audio is playing - Check device: `--list-devices` - Increase system volume **Poor quality:** - Use larger model: `--model medium` - Increase interval: `--interval 10` - Specify language: `--language ` **Ollama errors:** - Ensure Ollama is running - Pull model: `ollama pull llama3.2` ## Output Format **Standard:** ``` [14:23:15] Transcribed audio segment. [14:23:23] Another segment with timestamp. ``` **With LLM (--enable-llm):** ``` ====================================================================== [14:23:15] The Earth revolves around the Sun in 365 days. 📊 Fact Check: FACTUAL (confidence: 0.98) 💡 Scientifically accurate. Earth's orbital period is 365.25 days. ❓ Questions: 1. Why do we need leap years? 2. How does Earth's orbit affect seasons? ====================================================================== ``` ## Technical Stack - **Audio**: sounddevice, soundfile (16kHz mono, 16-bit PCM) - **Transcription**: faster-whisper (optimized Whisper) - **LLM**: Ollama (local inference) - **Capture**: WASAPI loopback (Windows), PulseAudio (Linux) ## Future Work - Real-time streaming transcription with reduced buffering - Speaker diarization improvements - Web interface for remote monitoring - Multi-device simultaneous transcription - Cloud LLM integration options - Custom vocabulary and domain adaptation - Noise reduction preprocessing ## License Uses [Whisper](https://github.com/openai/whisper) (OpenAI), [faster-whisper](https://github.com/SYSTRAN/faster-whisper) (SYSTRAN), and [Ollama](https://ollama.ai).