chore: update 6 file(s)

This commit is contained in:
mike
2025-12-17 22:30:41 +01:00
parent a53c0e2902
commit 4343b7a5a2
6 changed files with 1122 additions and 220 deletions

View File

@@ -1,105 +1,156 @@
# Quick Start Guide
## Dutch Language (Nederlands)
## 1. Setup Audio Devices
### Basic Dutch Transcription
```bash
./RUN_DUTCH.sh
# List available audio devices
./run_transcribe.sh --list-devices
```
- ✅ GPU-accelerated (RTX 4060 Ti)
- ✅ Sentence extraction (complete zinnen)
- ✅ Base model (goede balans snelheid/nauwkeurigheid)
### Dutch with LLM Analysis
Find your:
- **Microphone** - Your input device (e.g., "USB Microphone")
- **Monitor** - Speaker capture device (e.g., "Monitor of Built-in Audio")
---
## 2. Basic Usage
### Simple Transcription
```bash
./RUN_DUTCH_LLM.sh
# Auto-detect devices
./run_transcribe.sh --model medium --language en
# Specify devices
./run_transcribe.sh --mic "USB Mic" --monitor "Monitor"
```
- ✅ All features from basic version
- ✅ Fact-checking van uitspraken
- ✅ Automatische vraag generatie
- Uses llama3.2:latest model
### Save to File
### With File Output
```bash
./RUN_DUTCH.sh --output transcript.txt
./RUN_DUTCH_LLM.sh --output enriched.txt
./run_transcribe.sh --model medium --language en --output transcript.txt
```
### With LLM Analysis
```bash
./run_transcribe.sh --model medium --enable-llm --output enriched.txt
```
---
## English Language
## 3. Language Examples
### Basic English Transcription
### Dutch (Nederlands)
```bash
./RUN_GPU.sh
./run_transcribe.sh --model medium --language nl --enable-llm
```
### English with LLM
```bash
./RUN_GPU.sh --enable-llm
```
---
## Other Languages
### Spanish
```bash
./RUN_GPU.sh --language es
./run_transcribe.sh --model medium --language es
```
### French
```bash
./RUN_GPU.sh --language fr
./run_transcribe.sh --model medium --language fr
```
### German
```bash
./RUN_GPU.sh --language de
./run_transcribe.sh --model medium --language de
```
---
## Available Ollama Models
## 4. Model Selection
You have these models installed:
- `llama3.2:latest` (2.0 GB) - **Default** - Fast and accurate
- `llama3:8b` (4.7 GB) - More powerful
- `qwen2.5:3b` (1.9 GB) - Fast alternative
- `qwen2.5:7b` (4.7 GB) - Powerful alternative
- `qwen2.5:0.5b` (397 MB) - Very fast, less accurate
| Model | Speed | Quality | Command |
|--------|----------|---------|----------------------------------|
| tiny | Fastest | Basic | `--model tiny` |
| base | Fast | Good | `--model base` |
| small | Moderate | Better | `--model small` |
| medium | Slow | Great | `--model medium` **(recommended)** |
| large | Slowest | Best | `--model large` |
To use a different model:
---
## 5. Optimization Tips
### High Quality Transcription
```bash
./RUN_DUTCH_LLM.sh --llm-model "llama3:8b"
./run_transcribe.sh --model large --interval 8 --min-duration 4
```
### Fast Real-Time
```bash
./run_transcribe.sh --model tiny --interval 3 --min-duration 2
```
### Best Dutch Transcription (Your Setup)
```bash
./run_transcribe.sh --model medium --interval 8 --min-duration 4 --enable-llm --language nl
```
---
## Tips
## 6. LLM Configuration
### Better Accuracy
Use larger Whisper model (slower):
### Default Model (qwen2.5:3b - Fast)
```bash
./RUN_DUTCH.sh --model medium # or: large
./run_transcribe.sh --enable-llm
```
### Faster Processing
Use smaller model or reduce interval:
### Larger Model (Better Analysis)
```bash
./RUN_DUTCH.sh --model tiny --interval 3
```
# Install model first
ollama pull llama3.2
### Debug LLM Issues
```bash
./RUN_DUTCH_LLM.sh --llm-debug
# Use it
./run_transcribe.sh --enable-llm --llm-model llama3.2
```
---
## Controls
## 7. Output Examples
- **Ctrl+C** to stop transcription
- Speak clearly into your microphone
- Wait ~5 seconds for transcription to appear
- Sentences appear with 📝 emoji
### Console Output
```
🎤 [14:23:15] User speaking via microphone
🔊 [14:23:20] Audio from speakers
🎤 [14:23:25] The Earth orbits the Sun in 365 days.
✅ FACTUAL (0.98): Scientifically accurate.
❓ Questions:
1. Why do we need leap years?
2. How does orbital speed vary?
3. What affects Earth's orbit?
```
### File Output
Saved to `transcript.txt` or your specified file with timestamps and analysis.
---
## 8. Controls
- **Ctrl+C** - Stop transcription
- Processing happens every `--interval` seconds (default: 5s)
- Minimum `--min-duration` audio required (default: 2s)
---
## Troubleshooting
**No devices found:**
```bash
./run_transcribe.sh --list-devices
```
**Ollama errors:**
```bash
ollama serve
ollama pull qwen2.5:3b
```
**Force CPU (GPU issues):**
```bash
./run_transcribe.sh --force-cpu
```

249
README.md
View File

@@ -1,16 +1,15 @@
# Verbatim Dicta
Real-time audio transcription using Whisper AI with optional LLM-powered analysis. Captures system audio via loopback and transcribes it with configurable models and processing options.
Real-time audio transcription using Whisper AI with optional LLM analysis. Captures microphone and speaker audio simultaneously for comprehensive transcription.
## Features
- Real-time transcription of system audio (Windows/Linux)
- Multiple Whisper model sizes (tiny to large)
- Multi-language support
- **Sentence extraction mode** - Stitches audio chunks into complete sentences
- Optional LLM analysis for fact-checking and question generation (via Ollama)
- GPU acceleration support
- Flexible audio device configuration
- **Dual audio capture** - Record microphone and speaker output simultaneously
- **Real-time transcription** - Process audio as it's captured with Whisper models
- **LLM analysis** - Optional fact-checking and question generation via Ollama
- **Multi-language** - Support for 50+ languages
- **File output** - Save transcripts with timestamps and analysis
- **GPU acceleration** - CUDA support for faster processing
## Quick Start
@@ -18,17 +17,14 @@ Real-time audio transcription using Whisper AI with optional LLM-powered analysi
# Install dependencies
pip install -r requirements.txt
# Basic transcription (no LLM)
python transcribe_speakers.py
# With LLM analysis (optional)
python transcribe_speakers.py --enable-llm
# With sentence extraction
python transcribe_speakers.py --sentence-mode
# List audio devices
python transcribe_speakers.py --list-devices
./run_transcribe.sh --list-devices
# Basic transcription
./run_transcribe.sh --model medium --language en
# With LLM analysis and file output
./run_transcribe.sh --model medium --enable-llm --output transcript.txt
```
## Requirements
@@ -58,172 +54,153 @@ For CUDA 12.1:
pip install torch==2.8.0+cu121 --index-url https://download.pytorch.org/whl/cu121
```
### 3. Audio Loopback Setup
### 3. Audio Setup
**Windows - Option A (Stereo Mix):**
1. Right-click speaker icon → Sounds → Recording tab
2. Right-click → Show Disabled Devices
3. Enable and set Stereo Mix as default
**Linux (PulseAudio/PipeWire):**
```bash
# List devices to find your monitor device
./run_transcribe.sh --list-devices
**Windows - Option B (VB-Cable, recommended):**
1. Download from [vb-audio.com](https://vb-audio.com/Cable/)
2. Install and restart
3. Use `--device "CABLE Output"`
# Use with monitor device
./run_transcribe.sh --monitor "alsa_output.monitor"
```
**Linux:**
Configure PulseAudio loopback or use `transcribe_dual_linux.py`
**Windows:**
- Enable "Stereo Mix" in Sound settings, or
- Install VB-Cable from [vb-audio.com](https://vb-audio.com/Cable/)
### 4. LLM Features (Optional)
### 4. LLM Support (Optional)
```bash
# Install Ollama from ollama.ai
ollama pull llama3.2
ollama pull qwen2.5:3b
```
## Usage
### Available Scripts
- `transcribe_speakers.py` - Main script with all features (LLM optional via `--enable-llm`)
- `transcribe_dual_linux.py` - Linux-specific with dual audio support
### Common Commands
### Command Line Options
```bash
# Quick start with GPU (English)
./RUN_GPU.sh
python transcribe.py [OPTIONS]
# Dutch language
./RUN_DUTCH.sh
# Dutch with LLM analysis
./RUN_DUTCH_LLM.sh
# With LLM analysis
./RUN_GPU.sh --enable-llm
# Save to file
./RUN_GPU.sh --output transcript.txt
# Other languages (Spanish, French, German, etc.)
./RUN_GPU.sh --language es # Spanish
./RUN_GPU.sh --language fr # French
./RUN_GPU.sh --language de # German
# Maximum accuracy with LLM and sentence extraction
python transcribe_speakers.py --model large --enable-llm --sentence-mode --output enriched.txt
# Force CPU (if GPU issues)
python transcribe_speakers.py --force-cpu
Options:
--model {tiny,base,small,medium,large} Whisper model (default: tiny)
--language CODE Language code (default: en)
--mic DEVICE Microphone device name
--monitor DEVICE Speaker monitor device name
--interval SECONDS Processing interval (default: 5.0)
--min-duration SECONDS Minimum audio duration (default: 2.0)
--enable-llm Enable LLM analysis
--llm-model MODEL Ollama model (default: qwen2.5:3b)
--output FILE Save transcript to file
--force-cpu Force CPU processing
--list-devices List audio devices
```
### Key Options
### Examples
| Option | Description | Default |
|--------|-------------|---------|
| `--model` | Model size: tiny/base/small/medium/large | base |
| `--language` | Language code (en/es/fr/de/ja/etc.) | en |
| `--device` | Audio device name (partial match) | Auto |
| `--interval` | Processing interval (seconds) | 8.0 |
| `--min-duration` | Minimum audio duration | 3.0 |
| `--fast-mode` | Fast mode (3-5x faster, lower accuracy) | False |
| `--enable-llm` | Enable fact-checking and questions | False |
| `--llm-model` | Ollama model to use | llama3.2 |
| `--output` | Save to file | None |
| `--force-cpu` | Disable GPU | False |
| `--gpu-index` | GPU device index | 0 |
| `--sentence-mode` | Extract complete sentences from chunks | False |
```bash
# Dutch transcription with LLM
./run_transcribe.sh --model medium --language nl --enable-llm
# High-quality meeting transcription
./run_transcribe.sh --model large --interval 8 --output meeting.txt
# Fast real-time transcription
./run_transcribe.sh --model tiny --interval 3 --min-duration 2
# Specific devices
./run_transcribe.sh --mic "USB Mic" --monitor "Monitor of Speakers"
```
## Model Performance
| Model | Size | Speed | Quality | Best For |
|-------|------|-------|---------|----------|
| tiny | ~75 MB | Fastest | Basic | Quick tests, low-latency |
| base | ~145 MB | Fast | Good | General real-time use |
| small | ~485 MB | Moderate | Better | Balanced accuracy/speed |
| medium | ~1.5 GB | Slow | Great | High accuracy needs |
| large | ~3 GB | Slowest | Best | Maximum accuracy |
## Optimization Presets
**Low Latency (Real-Time):**
```bash
python transcribe_speakers.py --model tiny --fast-mode --interval 2 --min-duration 1.5
```
**Balanced:**
```bash
python transcribe_speakers.py --model base --interval 5
```
**High Accuracy:**
```bash
python transcribe_speakers.py --model large --interval 10 --enable-llm
```
| Model | Size | Speed | Quality | Use Case |
|--------|--------|----------|---------|------------------------|
| tiny | 75 MB | Fastest | Basic | Real-time, low latency |
| base | 145 MB | Fast | Good | General use |
| small | 485 MB | Moderate | Better | Balanced |
| medium | 1.5 GB | Slow | Great | High accuracy |
| large | 3 GB | Slowest | Best | Maximum quality |
## Troubleshooting
**No loopback device:**
- Windows: Enable Stereo Mix or install VB-Cable
- Linux: Configure PulseAudio loopback
**No audio devices found:**
```bash
# List all devices
./run_transcribe.sh --list-devices
# Specify devices explicitly
./run_transcribe.sh --mic "device_name" --monitor "monitor_name"
```
**CUDA errors:**
```bash
python transcribe_speakers.py --force-cpu
# Force CPU processing
./run_transcribe.sh --force-cpu
```
**No audio captured:**
- Verify audio is playing
- Check device: `--list-devices`
- Increase system volume
**Ollama connection failed:**
```bash
# Start Ollama service
ollama serve
**Poor quality:**
- Use larger model: `--model medium`
# Pull required model
ollama pull qwen2.5:3b
```
**Poor transcription quality:**
- Use larger model: `--model medium` or `--model large`
- Increase interval: `--interval 10`
- Specify language: `--language <code>`
**Ollama errors:**
- Ensure Ollama is running
- Pull model: `ollama pull llama3.2`
- Specify language: `--language nl`
- Ensure good audio quality (reduce background noise)
## Output Format
**Standard:**
### Standard Output
```
[14:23:15] Transcribed audio segment.
[14:23:23] Another segment with timestamp.
🎤 [14:23:15] User speaking into microphone
🔊 [14:23:18] Audio from speakers or system
```
**With LLM (--enable-llm):**
### With LLM Analysis
```
🎤 [14:23:15] The Earth orbits the Sun in 365 days.
✅ FACTUAL (0.98): Scientifically accurate orbital period.
❓ Questions:
1. Why do we need leap years?
2. How does the elliptical orbit affect seasons?
3. What factors influence Earth's orbital velocity?
```
### File Output
```
[14:23:15] MIC: User speaking into microphone
[14:23:18] SPEAKER: Audio from speakers
======================================================================
[14:23:15] The Earth revolves around the Sun in 365 days.
[14:23:25] MIC: The Earth orbits the Sun in 365 days.
📊 Fact Check: FACTUAL (confidence: 0.98)
💡 Scientifically accurate. Earth's orbital period is 365.25 days.
💡 Scientifically accurate orbital period.
❓ Questions:
1. Why do we need leap years?
2. How does Earth's orbit affect seasons?
======================================================================
2. How does the elliptical orbit affect seasons?
3. What factors influence Earth's orbital velocity?
```
## Technical Stack
## Architecture
- **Audio**: sounddevice, soundfile (16kHz mono, 16-bit PCM)
- **Transcription**: faster-whisper (optimized Whisper)
- **LLM**: Ollama (local inference)
- **Capture**: WASAPI loopback (Windows), PulseAudio (Linux)
- **Audio Capture**: sounddevice with dual-stream support
- **Transcription**: faster-whisper (optimized Whisper implementation)
- **LLM**: Ollama for local inference
- **Format**: 16kHz mono, 16-bit PCM
- **Processing**: Independent mic/speaker buffers with beam_size=3
## Future Work
## Contributing
- Real-time streaming transcription with reduced buffering
- Speaker diarization improvements
- Web interface for remote monitoring
- Multi-device simultaneous transcription
- Cloud LLM integration options
- Custom vocabulary and domain adaptation
- Noise reduction preprocessing
Contributions welcome! Please open issues or submit pull requests.
## License

View File

@@ -11,4 +11,4 @@ CUBLAS_PATH=".venv/lib/python3.13/site-packages/nvidia/cublas/lib"
export LD_LIBRARY_PATH="${CUDNN_PATH}:${CUBLAS_PATH}:${LD_LIBRARY_PATH}"
# Run the transcription script with all arguments
python3 transcribe_dual_linux.py "$@"
python3 transcribe.py "$@"

437
transcribe.py Normal file
View File

@@ -0,0 +1,437 @@
#!/usr/bin/env python3
"""
Real-time audio transcription with dual capture and optional LLM analysis.
Supports microphone + speaker monitor, file output, and fact-checking.
"""
import sounddevice as sd
import numpy as np
import threading
import queue
import time
import os
import argparse
from datetime import datetime
from faster_whisper import WhisperModel
try:
import ollama
OLLAMA_AVAILABLE = True
except ImportError:
OLLAMA_AVAILABLE = False
class DualAudioCapture:
"""Capture both microphone and speaker output simultaneously"""
def __init__(self, mic_device=None, monitor_device=None, sample_rate=16000, chunk_size=2048):
self.sample_rate = sample_rate
self.chunk_size = chunk_size
self.audio_queue = queue.Queue()
# Find devices
devices = sd.query_devices()
# Microphone (default input or specified)
if mic_device is None:
self.mic_device = sd.default.device[0] # Default input
else:
self.mic_device = self._find_device(mic_device, input_required=True)
# Monitor/Loopback (for speaker output)
if monitor_device:
self.monitor_device = self._find_device(monitor_device, input_required=True)
else:
self.monitor_device = None
print(f"✓ Microphone: {devices[self.mic_device]['name']} (index {self.mic_device})")
if self.monitor_device:
print(f"✓ Monitor: {devices[self.monitor_device]['name']} (index {self.monitor_device})")
else:
print("⚠ No monitor device - capturing microphone only")
# Start streams
self.mic_stream = sd.InputStream(
device=self.mic_device,
channels=1,
samplerate=sample_rate,
blocksize=chunk_size,
dtype='int16',
callback=self._mic_callback
)
if self.monitor_device:
self.monitor_stream = sd.InputStream(
device=self.monitor_device,
channels=1,
samplerate=sample_rate,
blocksize=chunk_size,
dtype='int16',
callback=self._monitor_callback
)
else:
self.monitor_stream = None
self.mic_stream.start()
if self.monitor_stream:
self.monitor_stream.start()
print("✓ Audio capture started")
def _find_device(self, device_name, input_required=True):
"""Find device by name substring"""
devices = sd.query_devices()
for i, dev in enumerate(devices):
if device_name.lower() in dev['name'].lower():
if not input_required or dev['max_input_channels'] > 0:
return i
raise RuntimeError(f"Device '{device_name}' not found")
def _mic_callback(self, indata, frames, time_info, status):
"""Microphone audio callback"""
if status:
print(f"⚠ Mic status: {status}")
self.audio_queue.put(('mic', indata.copy()))
def _monitor_callback(self, indata, frames, time_info, status):
"""Monitor/speaker audio callback"""
if status:
print(f"⚠ Monitor status: {status}")
self.audio_queue.put(('monitor', indata.copy()))
def read_chunk(self):
"""Read audio data from queue"""
try:
return self.audio_queue.get(timeout=0.05)
except queue.Empty:
return None
def close(self):
"""Cleanup resources"""
self.mic_stream.stop()
self.mic_stream.close()
if self.monitor_stream:
self.monitor_stream.stop()
self.monitor_stream.close()
class WhisperTranscriber:
"""Process audio with Whisper"""
def __init__(self, model_name="base", language="en", force_cpu=False):
print(f"Loading Whisper model '{model_name}'...")
import torch
has_cuda = torch.cuda.is_available() and not force_cpu
device = "cpu"
compute_type = "int8"
if has_cuda:
try:
import ctranslate2
if ctranslate2.get_cuda_device_count() > 0:
device = "cuda"
compute_type = "float16"
print(f"✓ Using GPU: {torch.cuda.get_device_name(0)}")
except Exception as e:
print(f"⚠ CUDA unavailable: {e}")
if device == "cpu":
print("✓ Using CPU")
model_kwargs = {"device": device, "compute_type": compute_type}
if device == "cpu":
model_kwargs["cpu_threads"] = 4
self.model = WhisperModel(model_name, **model_kwargs)
self.language = language
self.mic_buffer = np.array([], dtype=np.float32)
self.monitor_buffer = np.array([], dtype=np.float32)
self.lock = threading.Lock()
def add_audio(self, source, audio_chunk):
"""Add audio to appropriate buffer"""
with self.lock:
audio_float = audio_chunk.flatten().astype(np.float32) / 32768.0
if source == 'mic':
self.mic_buffer = np.concatenate([self.mic_buffer, audio_float])
else:
self.monitor_buffer = np.concatenate([self.monitor_buffer, audio_float])
def transcribe_chunk(self, min_duration=3.0):
"""Transcribe accumulated audio"""
with self.lock:
mic_duration = len(self.mic_buffer) / 16000
monitor_duration = len(self.monitor_buffer) / 16000
results = {}
# Transcribe microphone
if mic_duration >= min_duration:
mic_audio = self.mic_buffer.copy()
self.mic_buffer = np.array([], dtype=np.float32)
results['mic'] = self._transcribe(mic_audio)
# Transcribe monitor
if monitor_duration >= min_duration:
monitor_audio = self.monitor_buffer.copy()
self.monitor_buffer = np.array([], dtype=np.float32)
results['monitor'] = self._transcribe(monitor_audio)
return results if results else None
def _transcribe(self, audio):
"""Internal transcription"""
try:
segments, _ = self.model.transcribe(
audio,
language=self.language,
beam_size=3,
vad_filter=True,
vad_parameters=dict(min_silence_duration_ms=500)
)
text = " ".join([seg.text for seg in segments]).strip()
return text if text else None
except Exception as e:
print(f"❌ Transcription error: {e}")
return None
class LLMAnalyzer:
"""LLM analysis with fact-checking and question generation"""
def __init__(self, model="qwen2.5:3b"):
if not OLLAMA_AVAILABLE:
raise RuntimeError("Ollama not installed: pip install ollama")
self.model = model
try:
ollama.list()
print(f"✓ Ollama connected: {self.model}")
except Exception as e:
raise RuntimeError(f"Ollama not running: {e}")
def fact_check(self, text):
"""Quick fact-check"""
prompt = f"""Fact-check this statement. Reply ONLY with:
VERDICT: factual/dubious/false
CONFIDENCE: 0.0-1.0
REASON: one sentence
Statement: "{text}" """
try:
response = ollama.generate(
model=self.model,
prompt=prompt,
options={"temperature": 0.1, "num_predict": 80}
)
import re
response_text = response['response']
verdict = re.search(r'VERDICT:\s*(\w+)', response_text, re.I)
confidence = re.search(r'CONFIDENCE:\s*([\d.]+)', response_text, re.I)
reason = re.search(r'REASON:\s*(.+?)(?:\n|$)', response_text, re.I | re.DOTALL)
return {
'verdict': verdict.group(1).lower() if verdict else 'unknown',
'confidence': float(confidence.group(1)) if confidence else 0.5,
'reason': reason.group(1).strip() if reason else response_text[:150]
}
except Exception as e:
return {'verdict': 'error', 'confidence': 0.0, 'reason': str(e)}
def generate_questions(self, text):
"""Generate follow-up questions"""
prompt = f"""Generate 3 insightful questions about this. Reply ONLY with:
Q1: [question]
Q2: [question]
Q3: [question]
Statement: "{text}" """
try:
response = ollama.generate(
model=self.model,
prompt=prompt,
options={"temperature": 0.7, "num_predict": 120}
)
import re
response_text = response['response']
questions = []
for i in range(1, 4):
q_match = re.search(rf'Q{i}:\s*(.+?)(?:\n|$)', response_text, re.I)
if q_match:
question = q_match.group(1).strip()
if not question.endswith('?'):
question += '?'
questions.append(question)
# Fallback defaults
while len(questions) < 3:
defaults = ["What are the implications?", "What evidence supports this?", "What's the context?"]
questions.append(defaults[len(questions)])
return questions[:3]
except Exception as e:
return ["What are the key points?", "What supports this?", "What are the implications?"]
def save_transcript(text, source, timestamp, filename):
"""Append transcript to file"""
os.makedirs(os.path.dirname(filename) if os.path.dirname(filename) else '.', exist_ok=True)
with open(filename, "a", encoding="utf-8") as f:
source_label = "MIC" if source == 'mic' else "SPEAKER"
f.write(f"[{timestamp}] {source_label}: {text}\n")
def save_enriched_transcript(text, source, timestamp, fact_check, questions, filename):
"""Save enriched transcript with LLM analysis"""
os.makedirs(os.path.dirname(filename) if os.path.dirname(filename) else '.', exist_ok=True)
with open(filename, "a", encoding="utf-8") as f:
source_label = "MIC" if source == 'mic' else "SPEAKER"
f.write(f"\n{'='*70}\n")
f.write(f"[{timestamp}] {source_label}: {text}\n\n")
if fact_check:
f.write(f"📊 Fact Check: {fact_check['verdict'].upper()} ")
f.write(f"(confidence: {fact_check['confidence']:.2f})\n")
f.write(f"💡 {fact_check['reason']}\n\n")
if questions:
f.write("❓ Questions:\n")
for i, q in enumerate(questions, 1):
f.write(f"{i}. {q}\n")
f.write("\n")
def main():
parser = argparse.ArgumentParser(description="Real-time audio transcription with dual capture")
parser.add_argument("--model", default="tiny", choices=["tiny", "base", "small", "medium", "large"],
help="Whisper model (default: tiny)")
parser.add_argument("--language", default="en", help="Language code (default: en)")
parser.add_argument("--mic", help="Microphone device name (partial match)")
parser.add_argument("--monitor", help="Monitor device name for speaker capture")
parser.add_argument("--interval", type=float, default=5.0, help="Processing interval in seconds (default: 5.0)")
parser.add_argument("--min-duration", type=float, default=2.0, help="Minimum audio duration (default: 2.0)")
parser.add_argument("--enable-llm", action="store_true", help="Enable LLM analysis (fact-checking + questions)")
parser.add_argument("--llm-model", default="qwen2.5:3b", help="Ollama model (default: qwen2.5:3b)")
parser.add_argument("--output", "-o", help="Save transcript to file")
parser.add_argument("--list-devices", action="store_true", help="List audio devices and exit")
parser.add_argument("--force-cpu", action="store_true", help="Force CPU processing")
args = parser.parse_args()
if args.list_devices:
print("\nAvailable audio devices:")
for i, dev in enumerate(sd.query_devices()):
in_ch = dev['max_input_channels']
out_ch = dev['max_output_channels']
if in_ch > 0:
print(f" [{i:2d}] {dev['name']:<50} IN:{in_ch} OUT:{out_ch}")
return
print("=== Real-Time Audio Transcription ===")
print(f"Model: {args.model} | Language: {args.language} | Interval: {args.interval}s")
if args.output:
print(f"Output: {args.output}")
if args.enable_llm:
print(f"LLM Analysis: Enabled ({args.llm_model})")
# Initialize capture
try:
capturer = DualAudioCapture(
mic_device=args.mic,
monitor_device=args.monitor,
sample_rate=16000,
chunk_size=2048
)
except Exception as e:
print(f"\n❌ Audio Error: {e}")
print("\nTip: Use --list-devices to see available devices")
print(" Use --mic and --monitor to specify devices")
return
# Initialize transcriber
try:
transcriber = WhisperTranscriber(
model_name=args.model,
language=args.language,
force_cpu=args.force_cpu
)
except Exception as e:
print(f"\n❌ Whisper Error: {e}")
return
# Initialize LLM analyzer
llm_analyzer = None
if args.enable_llm:
try:
llm_analyzer = LLMAnalyzer(model=args.llm_model)
except Exception as e:
print(f"\n⚠ LLM Error: {e}")
print("Continuing without LLM analysis...")
# Main loop
print(f"\n✅ Started. Press Ctrl+C to stop.\n{'='*60}")
last_process = time.time()
try:
while True:
# Collect audio
chunk = capturer.read_chunk()
if chunk:
source, audio = chunk
transcriber.add_audio(source, audio)
# Process at intervals
if time.time() - last_process >= args.interval:
results = transcriber.transcribe_chunk(min_duration=args.min_duration)
if results:
timestamp = datetime.now().strftime("%H:%M:%S")
for source, text in results.items():
if text:
source_emoji = "🎤" if source == 'mic' else "🔊"
print(f"\n{source_emoji} [{timestamp}] {text}")
# LLM analysis
fact_check = None
questions = None
if llm_analyzer:
fact_check = llm_analyzer.fact_check(text)
questions = llm_analyzer.generate_questions(text)
verdict_emoji = {'factual': '', 'dubious': '⚠️', 'false': ''}.get(
fact_check['verdict'], '')
print(f" {verdict_emoji} {fact_check['verdict'].upper()} "
f"({fact_check['confidence']:.2f}): {fact_check['reason']}")
print(f" ❓ Questions:")
for i, q in enumerate(questions, 1):
print(f" {i}. {q}")
# Save to file
if args.output:
if llm_analyzer:
save_enriched_transcript(text, source, timestamp, fact_check, questions, args.output)
else:
save_transcript(text, source, timestamp, args.output)
last_process = time.time()
except KeyboardInterrupt:
print(f"\n{'='*60}\n🛑 Stopping...")
capturer.close()
if args.output and os.path.exists(args.output):
print(f"\n💾 Transcript saved: {os.path.abspath(args.output)}")
print("\n✅ Done!")
if __name__ == "__main__":
main()

View File

@@ -1,7 +1,7 @@
#!/usr/bin/env python3
"""
Real-time transcription with dual audio capture (microphone + speaker monitor).
Linux/PipeWire optimized with Ollama LLM fact-checking.
Real-time audio transcription with dual capture and optional LLM analysis.
Supports microphone + speaker monitor, file output, and fact-checking.
"""
import sounddevice as sd
@@ -9,6 +9,7 @@ import numpy as np
import threading
import queue
import time
import os
import argparse
from datetime import datetime
from faster_whisper import WhisperModel
@@ -197,8 +198,8 @@ class WhisperTranscriber:
return None
class LLMFactChecker:
"""Fast fact-checking with Ollama"""
class LLMAnalyzer:
"""LLM analysis with fact-checking and question generation"""
def __init__(self, model="qwen2.5:3b"):
if not OLLAMA_AVAILABLE:
@@ -228,34 +229,100 @@ Statement: "{text}" """
)
import re
text = response['response']
response_text = response['response']
verdict = re.search(r'VERDICT:\s*(\w+)', text, re.I)
confidence = re.search(r'CONFIDENCE:\s*([\d.]+)', text, re.I)
reason = re.search(r'REASON:\s*(.+?)(?:\n|$)', text, re.I | re.DOTALL)
verdict = re.search(r'VERDICT:\s*(\w+)', response_text, re.I)
confidence = re.search(r'CONFIDENCE:\s*([\d.]+)', response_text, re.I)
reason = re.search(r'REASON:\s*(.+?)(?:\n|$)', response_text, re.I | re.DOTALL)
return {
'verdict': verdict.group(1).lower() if verdict else 'unknown',
'confidence': float(confidence.group(1)) if confidence else 0.5,
'reason': reason.group(1).strip() if reason else text[:150]
'reason': reason.group(1).strip() if reason else response_text[:150]
}
except Exception as e:
return {'verdict': 'error', 'confidence': 0.0, 'reason': str(e)}
def generate_questions(self, text):
"""Generate follow-up questions"""
prompt = f"""Generate 3 insightful questions about this. Reply ONLY with:
Q1: [question]
Q2: [question]
Q3: [question]
Statement: "{text}" """
try:
response = ollama.generate(
model=self.model,
prompt=prompt,
options={"temperature": 0.7, "num_predict": 120}
)
import re
response_text = response['response']
questions = []
for i in range(1, 4):
q_match = re.search(rf'Q{i}:\s*(.+?)(?:\n|$)', response_text, re.I)
if q_match:
question = q_match.group(1).strip()
if not question.endswith('?'):
question += '?'
questions.append(question)
# Fallback defaults
while len(questions) < 3:
defaults = ["What are the implications?", "What evidence supports this?", "What's the context?"]
questions.append(defaults[len(questions)])
return questions[:3]
except Exception as e:
return ["What are the key points?", "What supports this?", "What are the implications?"]
def save_transcript(text, source, timestamp, filename):
"""Append transcript to file"""
os.makedirs(os.path.dirname(filename) if os.path.dirname(filename) else '.', exist_ok=True)
with open(filename, "a", encoding="utf-8") as f:
source_label = "MIC" if source == 'mic' else "SPEAKER"
f.write(f"[{timestamp}] {source_label}: {text}\n")
def save_enriched_transcript(text, source, timestamp, fact_check, questions, filename):
"""Save enriched transcript with LLM analysis"""
os.makedirs(os.path.dirname(filename) if os.path.dirname(filename) else '.', exist_ok=True)
with open(filename, "a", encoding="utf-8") as f:
source_label = "MIC" if source == 'mic' else "SPEAKER"
f.write(f"\n{'='*70}\n")
f.write(f"[{timestamp}] {source_label}: {text}\n\n")
if fact_check:
f.write(f"📊 Fact Check: {fact_check['verdict'].upper()} ")
f.write(f"(confidence: {fact_check['confidence']:.2f})\n")
f.write(f"💡 {fact_check['reason']}\n\n")
if questions:
f.write("❓ Questions:\n")
for i, q in enumerate(questions, 1):
f.write(f"{i}. {q}\n")
f.write("\n")
def main():
parser = argparse.ArgumentParser(description="Dual audio transcription with fact-checking")
parser.add_argument("--model", default="tiny", choices=["tiny", "base", "small", "medium"],
help="Whisper model (default: tiny for speed)")
parser.add_argument("--language", default="en", help="Language code")
parser = argparse.ArgumentParser(description="Real-time audio transcription with dual capture")
parser.add_argument("--model", default="tiny", choices=["tiny", "base", "small", "medium", "large"],
help="Whisper model (default: tiny)")
parser.add_argument("--language", default="en", help="Language code (default: en)")
parser.add_argument("--mic", help="Microphone device name (partial match)")
parser.add_argument("--monitor", help="Monitor device name for speaker capture")
parser.add_argument("--interval", type=float, default=5.0, help="Processing interval (seconds)")
parser.add_argument("--min-duration", type=float, default=2.0, help="Min audio duration")
parser.add_argument("--enable-llm", action="store_true", help="Enable fact-checking")
parser.add_argument("--llm-model", default="qwen2.5:3b", help="Ollama model")
parser.add_argument("--list-devices", action="store_true", help="List audio devices")
parser.add_argument("--force-cpu", action="store_true", help="Force CPU")
parser.add_argument("--interval", type=float, default=5.0, help="Processing interval in seconds (default: 5.0)")
parser.add_argument("--min-duration", type=float, default=2.0, help="Minimum audio duration (default: 2.0)")
parser.add_argument("--enable-llm", action="store_true", help="Enable LLM analysis (fact-checking + questions)")
parser.add_argument("--llm-model", default="qwen2.5:3b", help="Ollama model (default: qwen2.5:3b)")
parser.add_argument("--output", "-o", help="Save transcript to file")
parser.add_argument("--list-devices", action="store_true", help="List audio devices and exit")
parser.add_argument("--force-cpu", action="store_true", help="Force CPU processing")
args = parser.parse_args()
@@ -268,8 +335,12 @@ def main():
print(f" [{i:2d}] {dev['name']:<50} IN:{in_ch} OUT:{out_ch}")
return
print("=== Dual Audio Transcription with Fact-Checking ===")
print("=== Real-Time Audio Transcription ===")
print(f"Model: {args.model} | Language: {args.language} | Interval: {args.interval}s")
if args.output:
print(f"Output: {args.output}")
if args.enable_llm:
print(f"LLM Analysis: Enabled ({args.llm_model})")
# Initialize capture
try:
@@ -296,14 +367,14 @@ def main():
print(f"\n❌ Whisper Error: {e}")
return
# Initialize fact checker
fact_checker = None
# Initialize LLM analyzer
llm_analyzer = None
if args.enable_llm:
try:
fact_checker = LLMFactChecker(model=args.llm_model)
llm_analyzer = LLMAnalyzer(model=args.llm_model)
except Exception as e:
print(f"\n⚠ LLM Error: {e}")
print("Continuing without fact-checking...")
print("Continuing without LLM analysis...")
# Main loop
print(f"\n✅ Started. Press Ctrl+C to stop.\n{'='*60}")
@@ -329,10 +400,27 @@ def main():
source_emoji = "🎤" if source == 'mic' else "🔊"
print(f"\n{source_emoji} [{timestamp}] {text}")
if fact_checker:
fc = fact_checker.fact_check(text)
verdict_emoji = {'factual': '', 'dubious': '⚠️', 'false': ''}.get(fc['verdict'], '')
print(f" {verdict_emoji} {fc['verdict'].upper()} ({fc['confidence']:.2f}): {fc['reason']}")
# LLM analysis
fact_check = None
questions = None
if llm_analyzer:
fact_check = llm_analyzer.fact_check(text)
questions = llm_analyzer.generate_questions(text)
verdict_emoji = {'factual': '', 'dubious': '⚠️', 'false': ''}.get(
fact_check['verdict'], '')
print(f" {verdict_emoji} {fact_check['verdict'].upper()} "
f"({fact_check['confidence']:.2f}): {fact_check['reason']}")
print(f" ❓ Questions:")
for i, q in enumerate(questions, 1):
print(f" {i}. {q}")
# Save to file
if args.output:
if llm_analyzer:
save_enriched_transcript(text, source, timestamp, fact_check, questions, args.output)
else:
save_transcript(text, source, timestamp, args.output)
last_process = time.time()
@@ -340,6 +428,8 @@ def main():
print(f"\n{'='*60}\n🛑 Stopping...")
capturer.close()
if args.output and os.path.exists(args.output):
print(f"\n💾 Transcript saved: {os.path.abspath(args.output)}")
print("\n✅ Done!")

View File

@@ -0,0 +1,347 @@
#!/usr/bin/env python3
"""
Real-time transcription with dual audio capture (microphone + speaker monitor).
Linux/PipeWire optimized with Ollama LLM fact-checking.
"""
import sounddevice as sd
import numpy as np
import threading
import queue
import time
import argparse
from datetime import datetime
from faster_whisper import WhisperModel
try:
import ollama
OLLAMA_AVAILABLE = True
except ImportError:
OLLAMA_AVAILABLE = False
class DualAudioCapture:
"""Capture both microphone and speaker output simultaneously"""
def __init__(self, mic_device=None, monitor_device=None, sample_rate=16000, chunk_size=2048):
self.sample_rate = sample_rate
self.chunk_size = chunk_size
self.audio_queue = queue.Queue()
# Find devices
devices = sd.query_devices()
# Microphone (default input or specified)
if mic_device is None:
self.mic_device = sd.default.device[0] # Default input
else:
self.mic_device = self._find_device(mic_device, input_required=True)
# Monitor/Loopback (for speaker output)
if monitor_device:
self.monitor_device = self._find_device(monitor_device, input_required=True)
else:
self.monitor_device = None
print(f"✓ Microphone: {devices[self.mic_device]['name']} (index {self.mic_device})")
if self.monitor_device:
print(f"✓ Monitor: {devices[self.monitor_device]['name']} (index {self.monitor_device})")
else:
print("⚠ No monitor device - capturing microphone only")
# Start streams
self.mic_stream = sd.InputStream(
device=self.mic_device,
channels=1,
samplerate=sample_rate,
blocksize=chunk_size,
dtype='int16',
callback=self._mic_callback
)
if self.monitor_device:
self.monitor_stream = sd.InputStream(
device=self.monitor_device,
channels=1,
samplerate=sample_rate,
blocksize=chunk_size,
dtype='int16',
callback=self._monitor_callback
)
else:
self.monitor_stream = None
self.mic_stream.start()
if self.monitor_stream:
self.monitor_stream.start()
print("✓ Audio capture started")
def _find_device(self, device_name, input_required=True):
"""Find device by name substring"""
devices = sd.query_devices()
for i, dev in enumerate(devices):
if device_name.lower() in dev['name'].lower():
if not input_required or dev['max_input_channels'] > 0:
return i
raise RuntimeError(f"Device '{device_name}' not found")
def _mic_callback(self, indata, frames, time_info, status):
"""Microphone audio callback"""
if status:
print(f"⚠ Mic status: {status}")
self.audio_queue.put(('mic', indata.copy()))
def _monitor_callback(self, indata, frames, time_info, status):
"""Monitor/speaker audio callback"""
if status:
print(f"⚠ Monitor status: {status}")
self.audio_queue.put(('monitor', indata.copy()))
def read_chunk(self):
"""Read audio data from queue"""
try:
return self.audio_queue.get(timeout=0.05)
except queue.Empty:
return None
def close(self):
"""Cleanup resources"""
self.mic_stream.stop()
self.mic_stream.close()
if self.monitor_stream:
self.monitor_stream.stop()
self.monitor_stream.close()
class WhisperTranscriber:
"""Process audio with Whisper"""
def __init__(self, model_name="base", language="en", force_cpu=False):
print(f"Loading Whisper model '{model_name}'...")
import torch
has_cuda = torch.cuda.is_available() and not force_cpu
device = "cpu"
compute_type = "int8"
if has_cuda:
try:
import ctranslate2
if ctranslate2.get_cuda_device_count() > 0:
device = "cuda"
compute_type = "float16"
print(f"✓ Using GPU: {torch.cuda.get_device_name(0)}")
except Exception as e:
print(f"⚠ CUDA unavailable: {e}")
if device == "cpu":
print("✓ Using CPU")
model_kwargs = {"device": device, "compute_type": compute_type}
if device == "cpu":
model_kwargs["cpu_threads"] = 4
self.model = WhisperModel(model_name, **model_kwargs)
self.language = language
self.mic_buffer = np.array([], dtype=np.float32)
self.monitor_buffer = np.array([], dtype=np.float32)
self.lock = threading.Lock()
def add_audio(self, source, audio_chunk):
"""Add audio to appropriate buffer"""
with self.lock:
audio_float = audio_chunk.flatten().astype(np.float32) / 32768.0
if source == 'mic':
self.mic_buffer = np.concatenate([self.mic_buffer, audio_float])
else:
self.monitor_buffer = np.concatenate([self.monitor_buffer, audio_float])
def transcribe_chunk(self, min_duration=3.0):
"""Transcribe accumulated audio"""
with self.lock:
mic_duration = len(self.mic_buffer) / 16000
monitor_duration = len(self.monitor_buffer) / 16000
results = {}
# Transcribe microphone
if mic_duration >= min_duration:
mic_audio = self.mic_buffer.copy()
self.mic_buffer = np.array([], dtype=np.float32)
results['mic'] = self._transcribe(mic_audio)
# Transcribe monitor
if monitor_duration >= min_duration:
monitor_audio = self.monitor_buffer.copy()
self.monitor_buffer = np.array([], dtype=np.float32)
results['monitor'] = self._transcribe(monitor_audio)
return results if results else None
def _transcribe(self, audio):
"""Internal transcription"""
try:
segments, _ = self.model.transcribe(
audio,
language=self.language,
beam_size=3, # Faster than default 5
vad_filter=True,
vad_parameters=dict(min_silence_duration_ms=500)
)
text = " ".join([seg.text for seg in segments]).strip()
return text if text else None
except Exception as e:
print(f"❌ Transcription error: {e}")
return None
class LLMFactChecker:
"""Fast fact-checking with Ollama"""
def __init__(self, model="qwen2.5:3b"):
if not OLLAMA_AVAILABLE:
raise RuntimeError("Ollama not installed: pip install ollama")
self.model = model
try:
ollama.list()
print(f"✓ Ollama connected: {self.model}")
except Exception as e:
raise RuntimeError(f"Ollama not running: {e}")
def fact_check(self, text):
"""Quick fact-check"""
prompt = f"""Fact-check this statement. Reply ONLY with:
VERDICT: factual/dubious/false
CONFIDENCE: 0.0-1.0
REASON: one sentence
Statement: "{text}" """
try:
response = ollama.generate(
model=self.model,
prompt=prompt,
options={"temperature": 0.1, "num_predict": 80}
)
import re
text = response['response']
verdict = re.search(r'VERDICT:\s*(\w+)', text, re.I)
confidence = re.search(r'CONFIDENCE:\s*([\d.]+)', text, re.I)
reason = re.search(r'REASON:\s*(.+?)(?:\n|$)', text, re.I | re.DOTALL)
return {
'verdict': verdict.group(1).lower() if verdict else 'unknown',
'confidence': float(confidence.group(1)) if confidence else 0.5,
'reason': reason.group(1).strip() if reason else text[:150]
}
except Exception as e:
return {'verdict': 'error', 'confidence': 0.0, 'reason': str(e)}
def main():
parser = argparse.ArgumentParser(description="Dual audio transcription with fact-checking")
parser.add_argument("--model", default="tiny", choices=["tiny", "base", "small", "medium"],
help="Whisper model (default: tiny for speed)")
parser.add_argument("--language", default="en", help="Language code")
parser.add_argument("--mic", help="Microphone device name (partial match)")
parser.add_argument("--monitor", help="Monitor device name for speaker capture")
parser.add_argument("--interval", type=float, default=5.0, help="Processing interval (seconds)")
parser.add_argument("--min-duration", type=float, default=2.0, help="Min audio duration")
parser.add_argument("--enable-llm", action="store_true", help="Enable fact-checking")
parser.add_argument("--llm-model", default="qwen2.5:3b", help="Ollama model")
parser.add_argument("--list-devices", action="store_true", help="List audio devices")
parser.add_argument("--force-cpu", action="store_true", help="Force CPU")
args = parser.parse_args()
if args.list_devices:
print("\nAvailable audio devices:")
for i, dev in enumerate(sd.query_devices()):
in_ch = dev['max_input_channels']
out_ch = dev['max_output_channels']
if in_ch > 0:
print(f" [{i:2d}] {dev['name']:<50} IN:{in_ch} OUT:{out_ch}")
return
print("=== Dual Audio Transcription with Fact-Checking ===")
print(f"Model: {args.model} | Language: {args.language} | Interval: {args.interval}s")
# Initialize capture
try:
capturer = DualAudioCapture(
mic_device=args.mic,
monitor_device=args.monitor,
sample_rate=16000,
chunk_size=2048
)
except Exception as e:
print(f"\n❌ Audio Error: {e}")
print("\nTip: Use --list-devices to see available devices")
print(" Use --mic and --monitor to specify devices")
return
# Initialize transcriber
try:
transcriber = WhisperTranscriber(
model_name=args.model,
language=args.language,
force_cpu=args.force_cpu
)
except Exception as e:
print(f"\n❌ Whisper Error: {e}")
return
# Initialize fact checker
fact_checker = None
if args.enable_llm:
try:
fact_checker = LLMFactChecker(model=args.llm_model)
except Exception as e:
print(f"\n⚠ LLM Error: {e}")
print("Continuing without fact-checking...")
# Main loop
print(f"\n✅ Started. Press Ctrl+C to stop.\n{'='*60}")
last_process = time.time()
try:
while True:
# Collect audio
chunk = capturer.read_chunk()
if chunk:
source, audio = chunk
transcriber.add_audio(source, audio)
# Process at intervals
if time.time() - last_process >= args.interval:
results = transcriber.transcribe_chunk(min_duration=args.min_duration)
if results:
timestamp = datetime.now().strftime("%H:%M:%S")
for source, text in results.items():
if text:
source_emoji = "🎤" if source == 'mic' else "🔊"
print(f"\n{source_emoji} [{timestamp}] {text}")
if fact_checker:
fc = fact_checker.fact_check(text)
verdict_emoji = {'factual': '', 'dubious': '⚠️', 'false': ''}.get(fc['verdict'], '')
print(f" {verdict_emoji} {fc['verdict'].upper()} ({fc['confidence']:.2f}): {fc['reason']}")
last_process = time.time()
except KeyboardInterrupt:
print(f"\n{'='*60}\n🛑 Stopping...")
capturer.close()
print("\n✅ Done!")
if __name__ == "__main__":
main()