Files
verbatim-dicta/README.md
2025-12-17 22:30:41 +01:00

5.7 KiB

Verbatim Dicta

Real-time audio transcription using Whisper AI with optional LLM analysis. Captures microphone and speaker audio simultaneously for comprehensive transcription.

Features

  • Dual audio capture - Record microphone and speaker output simultaneously
  • Real-time transcription - Process audio as it's captured with Whisper models
  • LLM analysis - Optional fact-checking and question generation via Ollama
  • Multi-language - Support for 50+ languages
  • File output - Save transcripts with timestamps and analysis
  • GPU acceleration - CUDA support for faster processing

Quick Start

# Install dependencies
pip install -r requirements.txt

# List audio devices
./run_transcribe.sh --list-devices

# Basic transcription
./run_transcribe.sh --model medium --language en

# With LLM analysis and file output
./run_transcribe.sh --model medium --enable-llm --output transcript.txt

Requirements

  • OS: Windows 10/11 or Linux
  • Python: 3.8+
  • Audio: Loopback device (Stereo Mix/VB-Cable on Windows, PulseAudio on Linux)
  • Optional: CUDA-capable GPU, Ollama for LLM features

Installation

1. Install Dependencies

pip install -r requirements.txt

2. GPU Support (Optional)

For CUDA 11.8:

pip install torch==2.8.0+cu118 --index-url https://download.pytorch.org/whl/cu118

For CUDA 12.1:

pip install torch==2.8.0+cu121 --index-url https://download.pytorch.org/whl/cu121

3. Audio Setup

Linux (PulseAudio/PipeWire):

# List devices to find your monitor device
./run_transcribe.sh --list-devices

# Use with monitor device
./run_transcribe.sh --monitor "alsa_output.monitor"

Windows:

  • Enable "Stereo Mix" in Sound settings, or
  • Install VB-Cable from vb-audio.com

4. LLM Support (Optional)

# Install Ollama from ollama.ai
ollama pull qwen2.5:3b

Usage

Command Line Options

python transcribe.py [OPTIONS]

Options:
  --model {tiny,base,small,medium,large}  Whisper model (default: tiny)
  --language CODE                         Language code (default: en)
  --mic DEVICE                            Microphone device name
  --monitor DEVICE                        Speaker monitor device name
  --interval SECONDS                      Processing interval (default: 5.0)
  --min-duration SECONDS                  Minimum audio duration (default: 2.0)
  --enable-llm                            Enable LLM analysis
  --llm-model MODEL                       Ollama model (default: qwen2.5:3b)
  --output FILE                           Save transcript to file
  --force-cpu                             Force CPU processing
  --list-devices                          List audio devices

Examples

# Dutch transcription with LLM
./run_transcribe.sh --model medium --language nl --enable-llm

# High-quality meeting transcription
./run_transcribe.sh --model large --interval 8 --output meeting.txt

# Fast real-time transcription
./run_transcribe.sh --model tiny --interval 3 --min-duration 2

# Specific devices
./run_transcribe.sh --mic "USB Mic" --monitor "Monitor of Speakers"

Model Performance

Model Size Speed Quality Use Case
tiny 75 MB Fastest Basic Real-time, low latency
base 145 MB Fast Good General use
small 485 MB Moderate Better Balanced
medium 1.5 GB Slow Great High accuracy
large 3 GB Slowest Best Maximum quality

Troubleshooting

No audio devices found:

# List all devices
./run_transcribe.sh --list-devices

# Specify devices explicitly
./run_transcribe.sh --mic "device_name" --monitor "monitor_name"

CUDA errors:

# Force CPU processing
./run_transcribe.sh --force-cpu

Ollama connection failed:

# Start Ollama service
ollama serve

# Pull required model
ollama pull qwen2.5:3b

Poor transcription quality:

  • Use larger model: --model medium or --model large
  • Increase interval: --interval 10
  • Specify language: --language nl
  • Ensure good audio quality (reduce background noise)

Output Format

Standard Output

🎤 [14:23:15] User speaking into microphone
🔊 [14:23:18] Audio from speakers or system

With LLM Analysis

🎤 [14:23:15] The Earth orbits the Sun in 365 days.
   ✅ FACTUAL (0.98): Scientifically accurate orbital period.
   ❓ Questions:
      1. Why do we need leap years?
      2. How does the elliptical orbit affect seasons?
      3. What factors influence Earth's orbital velocity?

File Output

[14:23:15] MIC: User speaking into microphone
[14:23:18] SPEAKER: Audio from speakers

======================================================================
[14:23:25] MIC: The Earth orbits the Sun in 365 days.

📊 Fact Check: FACTUAL (confidence: 0.98)
💡 Scientifically accurate orbital period.

❓ Questions:
1. Why do we need leap years?
2. How does the elliptical orbit affect seasons?
3. What factors influence Earth's orbital velocity?

Architecture

  • Audio Capture: sounddevice with dual-stream support
  • Transcription: faster-whisper (optimized Whisper implementation)
  • LLM: Ollama for local inference
  • Format: 16kHz mono, 16-bit PCM
  • Processing: Independent mic/speaker buffers with beam_size=3

Contributing

Contributions welcome! Please open issues or submit pull requests.

License

Uses Whisper (OpenAI), faster-whisper (SYSTRAN), and Ollama.