Files
verbatim-dicta/README.md
2025-12-17 16:33:19 +01:00

6.0 KiB

Verbatim Dicta

Real-time audio transcription using Whisper AI with optional LLM-powered analysis. Captures system audio via loopback and transcribes it with configurable models and processing options.

Features

  • Real-time transcription of system audio (Windows/Linux)
  • Multiple Whisper model sizes (tiny to large)
  • Multi-language support
  • Sentence extraction mode - Stitches audio chunks into complete sentences
  • Optional LLM analysis for fact-checking and question generation (via Ollama)
  • GPU acceleration support
  • Flexible audio device configuration

Quick Start

# Install dependencies
pip install -r requirements.txt

# Basic transcription (no LLM)
python transcribe_speakers.py

# With LLM analysis (optional)
python transcribe_speakers.py --enable-llm

# With sentence extraction
python transcribe_speakers.py --sentence-mode

# List audio devices
python transcribe_speakers.py --list-devices

Requirements

  • OS: Windows 10/11 or Linux
  • Python: 3.8+
  • Audio: Loopback device (Stereo Mix/VB-Cable on Windows, PulseAudio on Linux)
  • Optional: CUDA-capable GPU, Ollama for LLM features

Installation

1. Install Dependencies

pip install -r requirements.txt

2. GPU Support (Optional)

For CUDA 11.8:

pip install torch==2.8.0+cu118 --index-url https://download.pytorch.org/whl/cu118

For CUDA 12.1:

pip install torch==2.8.0+cu121 --index-url https://download.pytorch.org/whl/cu121

3. Audio Loopback Setup

Windows - Option A (Stereo Mix):

  1. Right-click speaker icon → Sounds → Recording tab
  2. Right-click → Show Disabled Devices
  3. Enable and set Stereo Mix as default

Windows - Option B (VB-Cable, recommended):

  1. Download from vb-audio.com
  2. Install and restart
  3. Use --device "CABLE Output"

Linux: Configure PulseAudio loopback or use transcribe_dual_linux.py

4. LLM Features (Optional)

# Install Ollama from ollama.ai
ollama pull llama3.2

Usage

Available Scripts

  • transcribe_speakers.py - Main script with all features (LLM optional via --enable-llm)
  • transcribe_dual_linux.py - Linux-specific with dual audio support

Common Commands

# Specify device and model
python transcribe_speakers.py --device "CABLE Output" --model medium

# Save to file with language
python transcribe_speakers.py --language es --output transcript.txt

# Fast mode (low latency)
python transcribe_speakers.py --fast-mode --model tiny --interval 3

# Extract complete sentences from chunks
python transcribe_speakers.py --sentence-mode --output sentences.txt

# Maximum accuracy with LLM and sentence extraction
python transcribe_speakers.py --model large --enable-llm --sentence-mode --output enriched.txt

# Force CPU (avoid GPU issues)
python transcribe_speakers.py --force-cpu

Key Options

Option Description Default
--model Model size: tiny/base/small/medium/large base
--language Language code (en/es/fr/de/ja/etc.) en
--device Audio device name (partial match) Auto
--interval Processing interval (seconds) 8.0
--min-duration Minimum audio duration 3.0
--fast-mode Fast mode (3-5x faster, lower accuracy) False
--enable-llm Enable fact-checking and questions False
--llm-model Ollama model to use llama3.2
--output Save to file None
--force-cpu Disable GPU False
--gpu-index GPU device index 0
--sentence-mode Extract complete sentences from chunks False

Model Performance

Model Size Speed Quality Best For
tiny ~75 MB Fastest Basic Quick tests, low-latency
base ~145 MB Fast Good General real-time use
small ~485 MB Moderate Better Balanced accuracy/speed
medium ~1.5 GB Slow Great High accuracy needs
large ~3 GB Slowest Best Maximum accuracy

Optimization Presets

Low Latency (Real-Time):

python transcribe_speakers.py --model tiny --fast-mode --interval 2 --min-duration 1.5

Balanced:

python transcribe_speakers.py --model base --interval 5

High Accuracy:

python transcribe_speakers.py --model large --interval 10 --enable-llm

Troubleshooting

No loopback device:

  • Windows: Enable Stereo Mix or install VB-Cable
  • Linux: Configure PulseAudio loopback

CUDA errors:

python transcribe_speakers.py --force-cpu

No audio captured:

  • Verify audio is playing
  • Check device: --list-devices
  • Increase system volume

Poor quality:

  • Use larger model: --model medium
  • Increase interval: --interval 10
  • Specify language: --language <code>

Ollama errors:

  • Ensure Ollama is running
  • Pull model: ollama pull llama3.2

Output Format

Standard:

[14:23:15] Transcribed audio segment.
[14:23:23] Another segment with timestamp.

With LLM (--enable-llm):

======================================================================
[14:23:15] The Earth revolves around the Sun in 365 days.

📊 Fact Check: FACTUAL (confidence: 0.98)
💡 Scientifically accurate. Earth's orbital period is 365.25 days.

❓ Questions:
1. Why do we need leap years?
2. How does Earth's orbit affect seasons?
======================================================================

Technical Stack

  • Audio: sounddevice, soundfile (16kHz mono, 16-bit PCM)
  • Transcription: faster-whisper (optimized Whisper)
  • LLM: Ollama (local inference)
  • Capture: WASAPI loopback (Windows), PulseAudio (Linux)

Future Work

  • Real-time streaming transcription with reduced buffering
  • Speaker diarization improvements
  • Web interface for remote monitoring
  • Multi-device simultaneous transcription
  • Cloud LLM integration options
  • Custom vocabulary and domain adaptation
  • Noise reduction preprocessing

License

Uses Whisper (OpenAI), faster-whisper (SYSTRAN), and Ollama.