Files

mike a53c0e2902 init

2025-12-17 22:11:08 +01:00

6.0 KiB

Raw Blame History

Verbatim Dicta

Real-time audio transcription using Whisper AI with optional LLM-powered analysis. Captures system audio via loopback and transcribes it with configurable models and processing options.

Features

Real-time transcription of system audio (Windows/Linux)
Multiple Whisper model sizes (tiny to large)
Multi-language support
Sentence extraction mode - Stitches audio chunks into complete sentences
Optional LLM analysis for fact-checking and question generation (via Ollama)
GPU acceleration support
Flexible audio device configuration

Quick Start

# Install dependencies
pip install -r requirements.txt

# Basic transcription (no LLM)
python transcribe_speakers.py

# With LLM analysis (optional)
python transcribe_speakers.py --enable-llm

# With sentence extraction
python transcribe_speakers.py --sentence-mode

# List audio devices
python transcribe_speakers.py --list-devices

Requirements

OS: Windows 10/11 or Linux
Python: 3.8+
Audio: Loopback device (Stereo Mix/VB-Cable on Windows, PulseAudio on Linux)
Optional: CUDA-capable GPU, Ollama for LLM features

Installation

1. Install Dependencies

pip install -r requirements.txt

2. GPU Support (Optional)

For CUDA 11.8:

pip install torch==2.8.0+cu118 --index-url https://download.pytorch.org/whl/cu118

For CUDA 12.1:

pip install torch==2.8.0+cu121 --index-url https://download.pytorch.org/whl/cu121

3. Audio Loopback Setup

Windows - Option A (Stereo Mix):

Right-click speaker icon → Sounds → Recording tab
Right-click → Show Disabled Devices
Enable and set Stereo Mix as default

Windows - Option B (VB-Cable, recommended):

Download from vb-audio.com
Install and restart
Use --device "CABLE Output"

Linux: Configure PulseAudio loopback or use transcribe_dual_linux.py

4. LLM Features (Optional)

# Install Ollama from ollama.ai
ollama pull llama3.2

Usage

Available Scripts

transcribe_speakers.py - Main script with all features (LLM optional via --enable-llm)
transcribe_dual_linux.py - Linux-specific with dual audio support

Common Commands

# Quick start with GPU (English)
./RUN_GPU.sh

# Dutch language
./RUN_DUTCH.sh

# Dutch with LLM analysis
./RUN_DUTCH_LLM.sh

# With LLM analysis
./RUN_GPU.sh --enable-llm

# Save to file
./RUN_GPU.sh --output transcript.txt

# Other languages (Spanish, French, German, etc.)
./RUN_GPU.sh --language es  # Spanish
./RUN_GPU.sh --language fr  # French
./RUN_GPU.sh --language de  # German

# Maximum accuracy with LLM and sentence extraction
python transcribe_speakers.py --model large --enable-llm --sentence-mode --output enriched.txt

# Force CPU (if GPU issues)
python transcribe_speakers.py --force-cpu

Key Options

Option	Description	Default
`--model`	Model size: tiny/base/small/medium/large	base
`--language`	Language code (en/es/fr/de/ja/etc.)	en
`--device`	Audio device name (partial match)	Auto
`--interval`	Processing interval (seconds)	8.0
`--min-duration`	Minimum audio duration	3.0
`--fast-mode`	Fast mode (3-5x faster, lower accuracy)	False
`--enable-llm`	Enable fact-checking and questions	False
`--llm-model`	Ollama model to use	llama3.2
`--output`	Save to file	None
`--force-cpu`	Disable GPU	False
`--gpu-index`	GPU device index	0
`--sentence-mode`	Extract complete sentences from chunks	False

Model Performance

Model	Size	Speed	Quality	Best For
tiny	~75 MB	Fastest	Basic	Quick tests, low-latency
base	~145 MB	Fast	Good	General real-time use
small	~485 MB	Moderate	Better	Balanced accuracy/speed
medium	~1.5 GB	Slow	Great	High accuracy needs
large	~3 GB	Slowest	Best	Maximum accuracy

Optimization Presets

Low Latency (Real-Time):

python transcribe_speakers.py --model tiny --fast-mode --interval 2 --min-duration 1.5

Balanced:

python transcribe_speakers.py --model base --interval 5

High Accuracy:

python transcribe_speakers.py --model large --interval 10 --enable-llm

Troubleshooting

No loopback device:

Windows: Enable Stereo Mix or install VB-Cable
Linux: Configure PulseAudio loopback

CUDA errors:

python transcribe_speakers.py --force-cpu

No audio captured:

Verify audio is playing
Check device: --list-devices
Increase system volume

Poor quality:

Use larger model: --model medium
Increase interval: --interval 10
Specify language: --language <code>

Ollama errors:

Ensure Ollama is running
Pull model: ollama pull llama3.2

Output Format

Standard:

[14:23:15] Transcribed audio segment.
[14:23:23] Another segment with timestamp.

With LLM (--enable-llm):

======================================================================
[14:23:15] The Earth revolves around the Sun in 365 days.

📊 Fact Check: FACTUAL (confidence: 0.98)
💡 Scientifically accurate. Earth's orbital period is 365.25 days.

❓ Questions:
1. Why do we need leap years?
2. How does Earth's orbit affect seasons?
======================================================================

Technical Stack

Audio: sounddevice, soundfile (16kHz mono, 16-bit PCM)
Transcription: faster-whisper (optimized Whisper)
LLM: Ollama (local inference)
Capture: WASAPI loopback (Windows), PulseAudio (Linux)

Future Work

Real-time streaming transcription with reduced buffering
Speaker diarization improvements
Web interface for remote monitoring
Multi-device simultaneous transcription
Cloud LLM integration options
Custom vocabulary and domain adaptation
Noise reduction preprocessing

License

Uses Whisper (OpenAI), faster-whisper (SYSTRAN), and Ollama.

6.0 KiB Raw Blame History