Files
verbatim-dicta/README.md
2025-12-17 22:11:08 +01:00

231 lines
6.0 KiB
Markdown

# Verbatim Dicta
Real-time audio transcription using Whisper AI with optional LLM-powered analysis. Captures system audio via loopback and transcribes it with configurable models and processing options.
## Features
- Real-time transcription of system audio (Windows/Linux)
- Multiple Whisper model sizes (tiny to large)
- Multi-language support
- **Sentence extraction mode** - Stitches audio chunks into complete sentences
- Optional LLM analysis for fact-checking and question generation (via Ollama)
- GPU acceleration support
- Flexible audio device configuration
## Quick Start
```bash
# Install dependencies
pip install -r requirements.txt
# Basic transcription (no LLM)
python transcribe_speakers.py
# With LLM analysis (optional)
python transcribe_speakers.py --enable-llm
# With sentence extraction
python transcribe_speakers.py --sentence-mode
# List audio devices
python transcribe_speakers.py --list-devices
```
## Requirements
- **OS**: Windows 10/11 or Linux
- **Python**: 3.8+
- **Audio**: Loopback device (Stereo Mix/VB-Cable on Windows, PulseAudio on Linux)
- **Optional**: CUDA-capable GPU, Ollama for LLM features
## Installation
### 1. Install Dependencies
```bash
pip install -r requirements.txt
```
### 2. GPU Support (Optional)
For CUDA 11.8:
```bash
pip install torch==2.8.0+cu118 --index-url https://download.pytorch.org/whl/cu118
```
For CUDA 12.1:
```bash
pip install torch==2.8.0+cu121 --index-url https://download.pytorch.org/whl/cu121
```
### 3. Audio Loopback Setup
**Windows - Option A (Stereo Mix):**
1. Right-click speaker icon → Sounds → Recording tab
2. Right-click → Show Disabled Devices
3. Enable and set Stereo Mix as default
**Windows - Option B (VB-Cable, recommended):**
1. Download from [vb-audio.com](https://vb-audio.com/Cable/)
2. Install and restart
3. Use `--device "CABLE Output"`
**Linux:**
Configure PulseAudio loopback or use `transcribe_dual_linux.py`
### 4. LLM Features (Optional)
```bash
# Install Ollama from ollama.ai
ollama pull llama3.2
```
## Usage
### Available Scripts
- `transcribe_speakers.py` - Main script with all features (LLM optional via `--enable-llm`)
- `transcribe_dual_linux.py` - Linux-specific with dual audio support
### Common Commands
```bash
# Quick start with GPU (English)
./RUN_GPU.sh
# Dutch language
./RUN_DUTCH.sh
# Dutch with LLM analysis
./RUN_DUTCH_LLM.sh
# With LLM analysis
./RUN_GPU.sh --enable-llm
# Save to file
./RUN_GPU.sh --output transcript.txt
# Other languages (Spanish, French, German, etc.)
./RUN_GPU.sh --language es # Spanish
./RUN_GPU.sh --language fr # French
./RUN_GPU.sh --language de # German
# Maximum accuracy with LLM and sentence extraction
python transcribe_speakers.py --model large --enable-llm --sentence-mode --output enriched.txt
# Force CPU (if GPU issues)
python transcribe_speakers.py --force-cpu
```
### Key Options
| Option | Description | Default |
|--------|-------------|---------|
| `--model` | Model size: tiny/base/small/medium/large | base |
| `--language` | Language code (en/es/fr/de/ja/etc.) | en |
| `--device` | Audio device name (partial match) | Auto |
| `--interval` | Processing interval (seconds) | 8.0 |
| `--min-duration` | Minimum audio duration | 3.0 |
| `--fast-mode` | Fast mode (3-5x faster, lower accuracy) | False |
| `--enable-llm` | Enable fact-checking and questions | False |
| `--llm-model` | Ollama model to use | llama3.2 |
| `--output` | Save to file | None |
| `--force-cpu` | Disable GPU | False |
| `--gpu-index` | GPU device index | 0 |
| `--sentence-mode` | Extract complete sentences from chunks | False |
## Model Performance
| Model | Size | Speed | Quality | Best For |
|-------|------|-------|---------|----------|
| tiny | ~75 MB | Fastest | Basic | Quick tests, low-latency |
| base | ~145 MB | Fast | Good | General real-time use |
| small | ~485 MB | Moderate | Better | Balanced accuracy/speed |
| medium | ~1.5 GB | Slow | Great | High accuracy needs |
| large | ~3 GB | Slowest | Best | Maximum accuracy |
## Optimization Presets
**Low Latency (Real-Time):**
```bash
python transcribe_speakers.py --model tiny --fast-mode --interval 2 --min-duration 1.5
```
**Balanced:**
```bash
python transcribe_speakers.py --model base --interval 5
```
**High Accuracy:**
```bash
python transcribe_speakers.py --model large --interval 10 --enable-llm
```
## Troubleshooting
**No loopback device:**
- Windows: Enable Stereo Mix or install VB-Cable
- Linux: Configure PulseAudio loopback
**CUDA errors:**
```bash
python transcribe_speakers.py --force-cpu
```
**No audio captured:**
- Verify audio is playing
- Check device: `--list-devices`
- Increase system volume
**Poor quality:**
- Use larger model: `--model medium`
- Increase interval: `--interval 10`
- Specify language: `--language <code>`
**Ollama errors:**
- Ensure Ollama is running
- Pull model: `ollama pull llama3.2`
## Output Format
**Standard:**
```
[14:23:15] Transcribed audio segment.
[14:23:23] Another segment with timestamp.
```
**With LLM (--enable-llm):**
```
======================================================================
[14:23:15] The Earth revolves around the Sun in 365 days.
📊 Fact Check: FACTUAL (confidence: 0.98)
💡 Scientifically accurate. Earth's orbital period is 365.25 days.
❓ Questions:
1. Why do we need leap years?
2. How does Earth's orbit affect seasons?
======================================================================
```
## Technical Stack
- **Audio**: sounddevice, soundfile (16kHz mono, 16-bit PCM)
- **Transcription**: faster-whisper (optimized Whisper)
- **LLM**: Ollama (local inference)
- **Capture**: WASAPI loopback (Windows), PulseAudio (Linux)
## Future Work
- Real-time streaming transcription with reduced buffering
- Speaker diarization improvements
- Web interface for remote monitoring
- Multi-device simultaneous transcription
- Cloud LLM integration options
- Custom vocabulary and domain adaptation
- Noise reduction preprocessing
## License
Uses [Whisper](https://github.com/openai/whisper) (OpenAI), [faster-whisper](https://github.com/SYSTRAN/faster-whisper) (SYSTRAN), and [Ollama](https://ollama.ai).