6.0 KiB
6.0 KiB
Verbatim Dicta
Real-time audio transcription using Whisper AI with optional LLM-powered analysis. Captures system audio via loopback and transcribes it with configurable models and processing options.
Features
- Real-time transcription of system audio (Windows/Linux)
- Multiple Whisper model sizes (tiny to large)
- Multi-language support
- Sentence extraction mode - Stitches audio chunks into complete sentences
- Optional LLM analysis for fact-checking and question generation (via Ollama)
- GPU acceleration support
- Flexible audio device configuration
Quick Start
# Install dependencies
pip install -r requirements.txt
# Basic transcription (no LLM)
python transcribe_speakers.py
# With LLM analysis (optional)
python transcribe_speakers.py --enable-llm
# With sentence extraction
python transcribe_speakers.py --sentence-mode
# List audio devices
python transcribe_speakers.py --list-devices
Requirements
- OS: Windows 10/11 or Linux
- Python: 3.8+
- Audio: Loopback device (Stereo Mix/VB-Cable on Windows, PulseAudio on Linux)
- Optional: CUDA-capable GPU, Ollama for LLM features
Installation
1. Install Dependencies
pip install -r requirements.txt
2. GPU Support (Optional)
For CUDA 11.8:
pip install torch==2.8.0+cu118 --index-url https://download.pytorch.org/whl/cu118
For CUDA 12.1:
pip install torch==2.8.0+cu121 --index-url https://download.pytorch.org/whl/cu121
3. Audio Loopback Setup
Windows - Option A (Stereo Mix):
- Right-click speaker icon → Sounds → Recording tab
- Right-click → Show Disabled Devices
- Enable and set Stereo Mix as default
Windows - Option B (VB-Cable, recommended):
- Download from vb-audio.com
- Install and restart
- Use
--device "CABLE Output"
Linux:
Configure PulseAudio loopback or use transcribe_dual_linux.py
4. LLM Features (Optional)
# Install Ollama from ollama.ai
ollama pull llama3.2
Usage
Available Scripts
transcribe_speakers.py- Main script with all features (LLM optional via--enable-llm)transcribe_dual_linux.py- Linux-specific with dual audio support
Common Commands
# Quick start with GPU (English)
./RUN_GPU.sh
# Dutch language
./RUN_DUTCH.sh
# Dutch with LLM analysis
./RUN_DUTCH_LLM.sh
# With LLM analysis
./RUN_GPU.sh --enable-llm
# Save to file
./RUN_GPU.sh --output transcript.txt
# Other languages (Spanish, French, German, etc.)
./RUN_GPU.sh --language es # Spanish
./RUN_GPU.sh --language fr # French
./RUN_GPU.sh --language de # German
# Maximum accuracy with LLM and sentence extraction
python transcribe_speakers.py --model large --enable-llm --sentence-mode --output enriched.txt
# Force CPU (if GPU issues)
python transcribe_speakers.py --force-cpu
Key Options
| Option | Description | Default |
|---|---|---|
--model |
Model size: tiny/base/small/medium/large | base |
--language |
Language code (en/es/fr/de/ja/etc.) | en |
--device |
Audio device name (partial match) | Auto |
--interval |
Processing interval (seconds) | 8.0 |
--min-duration |
Minimum audio duration | 3.0 |
--fast-mode |
Fast mode (3-5x faster, lower accuracy) | False |
--enable-llm |
Enable fact-checking and questions | False |
--llm-model |
Ollama model to use | llama3.2 |
--output |
Save to file | None |
--force-cpu |
Disable GPU | False |
--gpu-index |
GPU device index | 0 |
--sentence-mode |
Extract complete sentences from chunks | False |
Model Performance
| Model | Size | Speed | Quality | Best For |
|---|---|---|---|---|
| tiny | ~75 MB | Fastest | Basic | Quick tests, low-latency |
| base | ~145 MB | Fast | Good | General real-time use |
| small | ~485 MB | Moderate | Better | Balanced accuracy/speed |
| medium | ~1.5 GB | Slow | Great | High accuracy needs |
| large | ~3 GB | Slowest | Best | Maximum accuracy |
Optimization Presets
Low Latency (Real-Time):
python transcribe_speakers.py --model tiny --fast-mode --interval 2 --min-duration 1.5
Balanced:
python transcribe_speakers.py --model base --interval 5
High Accuracy:
python transcribe_speakers.py --model large --interval 10 --enable-llm
Troubleshooting
No loopback device:
- Windows: Enable Stereo Mix or install VB-Cable
- Linux: Configure PulseAudio loopback
CUDA errors:
python transcribe_speakers.py --force-cpu
No audio captured:
- Verify audio is playing
- Check device:
--list-devices - Increase system volume
Poor quality:
- Use larger model:
--model medium - Increase interval:
--interval 10 - Specify language:
--language <code>
Ollama errors:
- Ensure Ollama is running
- Pull model:
ollama pull llama3.2
Output Format
Standard:
[14:23:15] Transcribed audio segment.
[14:23:23] Another segment with timestamp.
With LLM (--enable-llm):
======================================================================
[14:23:15] The Earth revolves around the Sun in 365 days.
📊 Fact Check: FACTUAL (confidence: 0.98)
💡 Scientifically accurate. Earth's orbital period is 365.25 days.
❓ Questions:
1. Why do we need leap years?
2. How does Earth's orbit affect seasons?
======================================================================
Technical Stack
- Audio: sounddevice, soundfile (16kHz mono, 16-bit PCM)
- Transcription: faster-whisper (optimized Whisper)
- LLM: Ollama (local inference)
- Capture: WASAPI loopback (Windows), PulseAudio (Linux)
Future Work
- Real-time streaming transcription with reduced buffering
- Speaker diarization improvements
- Web interface for remote monitoring
- Multi-device simultaneous transcription
- Cloud LLM integration options
- Custom vocabulary and domain adaptation
- Noise reduction preprocessing
License
Uses Whisper (OpenAI), faster-whisper (SYSTRAN), and Ollama.