diff --git a/QUICK_START.md b/QUICK_START.md index b8ee193..f10209f 100644 --- a/QUICK_START.md +++ b/QUICK_START.md @@ -1,105 +1,156 @@ # Quick Start Guide -## Dutch Language (Nederlands) +## 1. Setup Audio Devices -### Basic Dutch Transcription ```bash -./RUN_DUTCH.sh +# List available audio devices +./run_transcribe.sh --list-devices ``` -- ✅ GPU-accelerated (RTX 4060 Ti) -- ✅ Sentence extraction (complete zinnen) -- ✅ Base model (goede balans snelheid/nauwkeurigheid) -### Dutch with LLM Analysis +Find your: +- **Microphone** - Your input device (e.g., "USB Microphone") +- **Monitor** - Speaker capture device (e.g., "Monitor of Built-in Audio") + +--- + +## 2. Basic Usage + +### Simple Transcription ```bash -./RUN_DUTCH_LLM.sh +# Auto-detect devices +./run_transcribe.sh --model medium --language en + +# Specify devices +./run_transcribe.sh --mic "USB Mic" --monitor "Monitor" ``` -- ✅ All features from basic version -- ✅ Fact-checking van uitspraken -- ✅ Automatische vraag generatie -- Uses llama3.2:latest model -### Save to File +### With File Output ```bash -./RUN_DUTCH.sh --output transcript.txt -./RUN_DUTCH_LLM.sh --output enriched.txt +./run_transcribe.sh --model medium --language en --output transcript.txt +``` + +### With LLM Analysis +```bash +./run_transcribe.sh --model medium --enable-llm --output enriched.txt ``` --- -## English Language +## 3. Language Examples -### Basic English Transcription +### Dutch (Nederlands) ```bash -./RUN_GPU.sh +./run_transcribe.sh --model medium --language nl --enable-llm ``` -### English with LLM -```bash -./RUN_GPU.sh --enable-llm -``` - ---- - -## Other Languages - ### Spanish ```bash -./RUN_GPU.sh --language es +./run_transcribe.sh --model medium --language es ``` ### French ```bash -./RUN_GPU.sh --language fr +./run_transcribe.sh --model medium --language fr ``` ### German ```bash -./RUN_GPU.sh --language de +./run_transcribe.sh --model medium --language de ``` --- -## Available Ollama Models +## 4. Model Selection -You have these models installed: -- `llama3.2:latest` (2.0 GB) - **Default** - Fast and accurate -- `llama3:8b` (4.7 GB) - More powerful -- `qwen2.5:3b` (1.9 GB) - Fast alternative -- `qwen2.5:7b` (4.7 GB) - Powerful alternative -- `qwen2.5:0.5b` (397 MB) - Very fast, less accurate +| Model | Speed | Quality | Command | +|--------|----------|---------|----------------------------------| +| tiny | Fastest | Basic | `--model tiny` | +| base | Fast | Good | `--model base` | +| small | Moderate | Better | `--model small` | +| medium | Slow | Great | `--model medium` **(recommended)** | +| large | Slowest | Best | `--model large` | -To use a different model: +--- + +## 5. Optimization Tips + +### High Quality Transcription ```bash -./RUN_DUTCH_LLM.sh --llm-model "llama3:8b" +./run_transcribe.sh --model large --interval 8 --min-duration 4 +``` + +### Fast Real-Time +```bash +./run_transcribe.sh --model tiny --interval 3 --min-duration 2 +``` + +### Best Dutch Transcription (Your Setup) +```bash +./run_transcribe.sh --model medium --interval 8 --min-duration 4 --enable-llm --language nl ``` --- -## Tips +## 6. LLM Configuration -### Better Accuracy -Use larger Whisper model (slower): +### Default Model (qwen2.5:3b - Fast) ```bash -./RUN_DUTCH.sh --model medium # or: large +./run_transcribe.sh --enable-llm ``` -### Faster Processing -Use smaller model or reduce interval: +### Larger Model (Better Analysis) ```bash -./RUN_DUTCH.sh --model tiny --interval 3 -``` +# Install model first +ollama pull llama3.2 -### Debug LLM Issues -```bash -./RUN_DUTCH_LLM.sh --llm-debug +# Use it +./run_transcribe.sh --enable-llm --llm-model llama3.2 ``` --- -## Controls +## 7. Output Examples -- **Ctrl+C** to stop transcription -- Speak clearly into your microphone -- Wait ~5 seconds for transcription to appear -- Sentences appear with 📝 emoji +### Console Output +``` +🎤 [14:23:15] User speaking via microphone +🔊 [14:23:20] Audio from speakers + +🎤 [14:23:25] The Earth orbits the Sun in 365 days. + ✅ FACTUAL (0.98): Scientifically accurate. + ❓ Questions: + 1. Why do we need leap years? + 2. How does orbital speed vary? + 3. What affects Earth's orbit? +``` + +### File Output +Saved to `transcript.txt` or your specified file with timestamps and analysis. + +--- + +## 8. Controls + +- **Ctrl+C** - Stop transcription +- Processing happens every `--interval` seconds (default: 5s) +- Minimum `--min-duration` audio required (default: 2s) + +--- + +## Troubleshooting + +**No devices found:** +```bash +./run_transcribe.sh --list-devices +``` + +**Ollama errors:** +```bash +ollama serve +ollama pull qwen2.5:3b +``` + +**Force CPU (GPU issues):** +```bash +./run_transcribe.sh --force-cpu +``` diff --git a/README.md b/README.md index 7f2a43b..b244e8c 100644 --- a/README.md +++ b/README.md @@ -1,16 +1,15 @@ # Verbatim Dicta -Real-time audio transcription using Whisper AI with optional LLM-powered analysis. Captures system audio via loopback and transcribes it with configurable models and processing options. +Real-time audio transcription using Whisper AI with optional LLM analysis. Captures microphone and speaker audio simultaneously for comprehensive transcription. ## Features -- Real-time transcription of system audio (Windows/Linux) -- Multiple Whisper model sizes (tiny to large) -- Multi-language support -- **Sentence extraction mode** - Stitches audio chunks into complete sentences -- Optional LLM analysis for fact-checking and question generation (via Ollama) -- GPU acceleration support -- Flexible audio device configuration +- **Dual audio capture** - Record microphone and speaker output simultaneously +- **Real-time transcription** - Process audio as it's captured with Whisper models +- **LLM analysis** - Optional fact-checking and question generation via Ollama +- **Multi-language** - Support for 50+ languages +- **File output** - Save transcripts with timestamps and analysis +- **GPU acceleration** - CUDA support for faster processing ## Quick Start @@ -18,17 +17,14 @@ Real-time audio transcription using Whisper AI with optional LLM-powered analysi # Install dependencies pip install -r requirements.txt -# Basic transcription (no LLM) -python transcribe_speakers.py - -# With LLM analysis (optional) -python transcribe_speakers.py --enable-llm - -# With sentence extraction -python transcribe_speakers.py --sentence-mode - # List audio devices -python transcribe_speakers.py --list-devices +./run_transcribe.sh --list-devices + +# Basic transcription +./run_transcribe.sh --model medium --language en + +# With LLM analysis and file output +./run_transcribe.sh --model medium --enable-llm --output transcript.txt ``` ## Requirements @@ -58,172 +54,153 @@ For CUDA 12.1: pip install torch==2.8.0+cu121 --index-url https://download.pytorch.org/whl/cu121 ``` -### 3. Audio Loopback Setup +### 3. Audio Setup -**Windows - Option A (Stereo Mix):** -1. Right-click speaker icon → Sounds → Recording tab -2. Right-click → Show Disabled Devices -3. Enable and set Stereo Mix as default +**Linux (PulseAudio/PipeWire):** +```bash +# List devices to find your monitor device +./run_transcribe.sh --list-devices -**Windows - Option B (VB-Cable, recommended):** -1. Download from [vb-audio.com](https://vb-audio.com/Cable/) -2. Install and restart -3. Use `--device "CABLE Output"` +# Use with monitor device +./run_transcribe.sh --monitor "alsa_output.monitor" +``` -**Linux:** -Configure PulseAudio loopback or use `transcribe_dual_linux.py` +**Windows:** +- Enable "Stereo Mix" in Sound settings, or +- Install VB-Cable from [vb-audio.com](https://vb-audio.com/Cable/) -### 4. LLM Features (Optional) +### 4. LLM Support (Optional) ```bash # Install Ollama from ollama.ai -ollama pull llama3.2 +ollama pull qwen2.5:3b ``` ## Usage -### Available Scripts - -- `transcribe_speakers.py` - Main script with all features (LLM optional via `--enable-llm`) -- `transcribe_dual_linux.py` - Linux-specific with dual audio support - -### Common Commands +### Command Line Options ```bash -# Quick start with GPU (English) -./RUN_GPU.sh +python transcribe.py [OPTIONS] -# Dutch language -./RUN_DUTCH.sh - -# Dutch with LLM analysis -./RUN_DUTCH_LLM.sh - -# With LLM analysis -./RUN_GPU.sh --enable-llm - -# Save to file -./RUN_GPU.sh --output transcript.txt - -# Other languages (Spanish, French, German, etc.) -./RUN_GPU.sh --language es # Spanish -./RUN_GPU.sh --language fr # French -./RUN_GPU.sh --language de # German - -# Maximum accuracy with LLM and sentence extraction -python transcribe_speakers.py --model large --enable-llm --sentence-mode --output enriched.txt - -# Force CPU (if GPU issues) -python transcribe_speakers.py --force-cpu +Options: + --model {tiny,base,small,medium,large} Whisper model (default: tiny) + --language CODE Language code (default: en) + --mic DEVICE Microphone device name + --monitor DEVICE Speaker monitor device name + --interval SECONDS Processing interval (default: 5.0) + --min-duration SECONDS Minimum audio duration (default: 2.0) + --enable-llm Enable LLM analysis + --llm-model MODEL Ollama model (default: qwen2.5:3b) + --output FILE Save transcript to file + --force-cpu Force CPU processing + --list-devices List audio devices ``` -### Key Options +### Examples -| Option | Description | Default | -|--------|-------------|---------| -| `--model` | Model size: tiny/base/small/medium/large | base | -| `--language` | Language code (en/es/fr/de/ja/etc.) | en | -| `--device` | Audio device name (partial match) | Auto | -| `--interval` | Processing interval (seconds) | 8.0 | -| `--min-duration` | Minimum audio duration | 3.0 | -| `--fast-mode` | Fast mode (3-5x faster, lower accuracy) | False | -| `--enable-llm` | Enable fact-checking and questions | False | -| `--llm-model` | Ollama model to use | llama3.2 | -| `--output` | Save to file | None | -| `--force-cpu` | Disable GPU | False | -| `--gpu-index` | GPU device index | 0 | -| `--sentence-mode` | Extract complete sentences from chunks | False | +```bash +# Dutch transcription with LLM +./run_transcribe.sh --model medium --language nl --enable-llm + +# High-quality meeting transcription +./run_transcribe.sh --model large --interval 8 --output meeting.txt + +# Fast real-time transcription +./run_transcribe.sh --model tiny --interval 3 --min-duration 2 + +# Specific devices +./run_transcribe.sh --mic "USB Mic" --monitor "Monitor of Speakers" +``` ## Model Performance -| Model | Size | Speed | Quality | Best For | -|-------|------|-------|---------|----------| -| tiny | ~75 MB | Fastest | Basic | Quick tests, low-latency | -| base | ~145 MB | Fast | Good | General real-time use | -| small | ~485 MB | Moderate | Better | Balanced accuracy/speed | -| medium | ~1.5 GB | Slow | Great | High accuracy needs | -| large | ~3 GB | Slowest | Best | Maximum accuracy | - -## Optimization Presets - -**Low Latency (Real-Time):** -```bash -python transcribe_speakers.py --model tiny --fast-mode --interval 2 --min-duration 1.5 -``` - -**Balanced:** -```bash -python transcribe_speakers.py --model base --interval 5 -``` - -**High Accuracy:** -```bash -python transcribe_speakers.py --model large --interval 10 --enable-llm -``` +| Model | Size | Speed | Quality | Use Case | +|--------|--------|----------|---------|------------------------| +| tiny | 75 MB | Fastest | Basic | Real-time, low latency | +| base | 145 MB | Fast | Good | General use | +| small | 485 MB | Moderate | Better | Balanced | +| medium | 1.5 GB | Slow | Great | High accuracy | +| large | 3 GB | Slowest | Best | Maximum quality | ## Troubleshooting -**No loopback device:** -- Windows: Enable Stereo Mix or install VB-Cable -- Linux: Configure PulseAudio loopback +**No audio devices found:** +```bash +# List all devices +./run_transcribe.sh --list-devices + +# Specify devices explicitly +./run_transcribe.sh --mic "device_name" --monitor "monitor_name" +``` **CUDA errors:** ```bash -python transcribe_speakers.py --force-cpu +# Force CPU processing +./run_transcribe.sh --force-cpu ``` -**No audio captured:** -- Verify audio is playing -- Check device: `--list-devices` -- Increase system volume +**Ollama connection failed:** +```bash +# Start Ollama service +ollama serve -**Poor quality:** -- Use larger model: `--model medium` +# Pull required model +ollama pull qwen2.5:3b +``` + +**Poor transcription quality:** +- Use larger model: `--model medium` or `--model large` - Increase interval: `--interval 10` -- Specify language: `--language ` - -**Ollama errors:** -- Ensure Ollama is running -- Pull model: `ollama pull llama3.2` +- Specify language: `--language nl` +- Ensure good audio quality (reduce background noise) ## Output Format -**Standard:** +### Standard Output ``` -[14:23:15] Transcribed audio segment. -[14:23:23] Another segment with timestamp. +🎤 [14:23:15] User speaking into microphone +🔊 [14:23:18] Audio from speakers or system ``` -**With LLM (--enable-llm):** +### With LLM Analysis ``` +🎤 [14:23:15] The Earth orbits the Sun in 365 days. + ✅ FACTUAL (0.98): Scientifically accurate orbital period. + ❓ Questions: + 1. Why do we need leap years? + 2. How does the elliptical orbit affect seasons? + 3. What factors influence Earth's orbital velocity? +``` + +### File Output +``` +[14:23:15] MIC: User speaking into microphone +[14:23:18] SPEAKER: Audio from speakers + ====================================================================== -[14:23:15] The Earth revolves around the Sun in 365 days. +[14:23:25] MIC: The Earth orbits the Sun in 365 days. 📊 Fact Check: FACTUAL (confidence: 0.98) -💡 Scientifically accurate. Earth's orbital period is 365.25 days. +💡 Scientifically accurate orbital period. ❓ Questions: 1. Why do we need leap years? -2. How does Earth's orbit affect seasons? -====================================================================== +2. How does the elliptical orbit affect seasons? +3. What factors influence Earth's orbital velocity? ``` -## Technical Stack +## Architecture -- **Audio**: sounddevice, soundfile (16kHz mono, 16-bit PCM) -- **Transcription**: faster-whisper (optimized Whisper) -- **LLM**: Ollama (local inference) -- **Capture**: WASAPI loopback (Windows), PulseAudio (Linux) +- **Audio Capture**: sounddevice with dual-stream support +- **Transcription**: faster-whisper (optimized Whisper implementation) +- **LLM**: Ollama for local inference +- **Format**: 16kHz mono, 16-bit PCM +- **Processing**: Independent mic/speaker buffers with beam_size=3 -## Future Work +## Contributing -- Real-time streaming transcription with reduced buffering -- Speaker diarization improvements -- Web interface for remote monitoring -- Multi-device simultaneous transcription -- Cloud LLM integration options -- Custom vocabulary and domain adaptation -- Noise reduction preprocessing +Contributions welcome! Please open issues or submit pull requests. ## License diff --git a/run_transcribe.sh b/run_transcribe.sh index 36f4e52..b7352a2 100755 --- a/run_transcribe.sh +++ b/run_transcribe.sh @@ -11,4 +11,4 @@ CUBLAS_PATH=".venv/lib/python3.13/site-packages/nvidia/cublas/lib" export LD_LIBRARY_PATH="${CUDNN_PATH}:${CUBLAS_PATH}:${LD_LIBRARY_PATH}" # Run the transcription script with all arguments -python3 transcribe_dual_linux.py "$@" +python3 transcribe.py "$@" diff --git a/transcribe.py b/transcribe.py new file mode 100644 index 0000000..1ca4e1e --- /dev/null +++ b/transcribe.py @@ -0,0 +1,437 @@ +#!/usr/bin/env python3 +""" +Real-time audio transcription with dual capture and optional LLM analysis. +Supports microphone + speaker monitor, file output, and fact-checking. +""" + +import sounddevice as sd +import numpy as np +import threading +import queue +import time +import os +import argparse +from datetime import datetime +from faster_whisper import WhisperModel + +try: + import ollama + OLLAMA_AVAILABLE = True +except ImportError: + OLLAMA_AVAILABLE = False + + +class DualAudioCapture: + """Capture both microphone and speaker output simultaneously""" + + def __init__(self, mic_device=None, monitor_device=None, sample_rate=16000, chunk_size=2048): + self.sample_rate = sample_rate + self.chunk_size = chunk_size + self.audio_queue = queue.Queue() + + # Find devices + devices = sd.query_devices() + + # Microphone (default input or specified) + if mic_device is None: + self.mic_device = sd.default.device[0] # Default input + else: + self.mic_device = self._find_device(mic_device, input_required=True) + + # Monitor/Loopback (for speaker output) + if monitor_device: + self.monitor_device = self._find_device(monitor_device, input_required=True) + else: + self.monitor_device = None + + print(f"✓ Microphone: {devices[self.mic_device]['name']} (index {self.mic_device})") + if self.monitor_device: + print(f"✓ Monitor: {devices[self.monitor_device]['name']} (index {self.monitor_device})") + else: + print("⚠ No monitor device - capturing microphone only") + + # Start streams + self.mic_stream = sd.InputStream( + device=self.mic_device, + channels=1, + samplerate=sample_rate, + blocksize=chunk_size, + dtype='int16', + callback=self._mic_callback + ) + + if self.monitor_device: + self.monitor_stream = sd.InputStream( + device=self.monitor_device, + channels=1, + samplerate=sample_rate, + blocksize=chunk_size, + dtype='int16', + callback=self._monitor_callback + ) + else: + self.monitor_stream = None + + self.mic_stream.start() + if self.monitor_stream: + self.monitor_stream.start() + + print("✓ Audio capture started") + + def _find_device(self, device_name, input_required=True): + """Find device by name substring""" + devices = sd.query_devices() + for i, dev in enumerate(devices): + if device_name.lower() in dev['name'].lower(): + if not input_required or dev['max_input_channels'] > 0: + return i + raise RuntimeError(f"Device '{device_name}' not found") + + def _mic_callback(self, indata, frames, time_info, status): + """Microphone audio callback""" + if status: + print(f"⚠ Mic status: {status}") + self.audio_queue.put(('mic', indata.copy())) + + def _monitor_callback(self, indata, frames, time_info, status): + """Monitor/speaker audio callback""" + if status: + print(f"⚠ Monitor status: {status}") + self.audio_queue.put(('monitor', indata.copy())) + + def read_chunk(self): + """Read audio data from queue""" + try: + return self.audio_queue.get(timeout=0.05) + except queue.Empty: + return None + + def close(self): + """Cleanup resources""" + self.mic_stream.stop() + self.mic_stream.close() + if self.monitor_stream: + self.monitor_stream.stop() + self.monitor_stream.close() + + +class WhisperTranscriber: + """Process audio with Whisper""" + + def __init__(self, model_name="base", language="en", force_cpu=False): + print(f"Loading Whisper model '{model_name}'...") + + import torch + has_cuda = torch.cuda.is_available() and not force_cpu + + device = "cpu" + compute_type = "int8" + + if has_cuda: + try: + import ctranslate2 + if ctranslate2.get_cuda_device_count() > 0: + device = "cuda" + compute_type = "float16" + print(f"✓ Using GPU: {torch.cuda.get_device_name(0)}") + except Exception as e: + print(f"⚠ CUDA unavailable: {e}") + + if device == "cpu": + print("✓ Using CPU") + + model_kwargs = {"device": device, "compute_type": compute_type} + if device == "cpu": + model_kwargs["cpu_threads"] = 4 + + self.model = WhisperModel(model_name, **model_kwargs) + self.language = language + self.mic_buffer = np.array([], dtype=np.float32) + self.monitor_buffer = np.array([], dtype=np.float32) + self.lock = threading.Lock() + + def add_audio(self, source, audio_chunk): + """Add audio to appropriate buffer""" + with self.lock: + audio_float = audio_chunk.flatten().astype(np.float32) / 32768.0 + if source == 'mic': + self.mic_buffer = np.concatenate([self.mic_buffer, audio_float]) + else: + self.monitor_buffer = np.concatenate([self.monitor_buffer, audio_float]) + + def transcribe_chunk(self, min_duration=3.0): + """Transcribe accumulated audio""" + with self.lock: + mic_duration = len(self.mic_buffer) / 16000 + monitor_duration = len(self.monitor_buffer) / 16000 + + results = {} + + # Transcribe microphone + if mic_duration >= min_duration: + mic_audio = self.mic_buffer.copy() + self.mic_buffer = np.array([], dtype=np.float32) + results['mic'] = self._transcribe(mic_audio) + + # Transcribe monitor + if monitor_duration >= min_duration: + monitor_audio = self.monitor_buffer.copy() + self.monitor_buffer = np.array([], dtype=np.float32) + results['monitor'] = self._transcribe(monitor_audio) + + return results if results else None + + def _transcribe(self, audio): + """Internal transcription""" + try: + segments, _ = self.model.transcribe( + audio, + language=self.language, + beam_size=3, + vad_filter=True, + vad_parameters=dict(min_silence_duration_ms=500) + ) + text = " ".join([seg.text for seg in segments]).strip() + return text if text else None + except Exception as e: + print(f"❌ Transcription error: {e}") + return None + + +class LLMAnalyzer: + """LLM analysis with fact-checking and question generation""" + + def __init__(self, model="qwen2.5:3b"): + if not OLLAMA_AVAILABLE: + raise RuntimeError("Ollama not installed: pip install ollama") + + self.model = model + try: + ollama.list() + print(f"✓ Ollama connected: {self.model}") + except Exception as e: + raise RuntimeError(f"Ollama not running: {e}") + + def fact_check(self, text): + """Quick fact-check""" + prompt = f"""Fact-check this statement. Reply ONLY with: +VERDICT: factual/dubious/false +CONFIDENCE: 0.0-1.0 +REASON: one sentence + +Statement: "{text}" """ + + try: + response = ollama.generate( + model=self.model, + prompt=prompt, + options={"temperature": 0.1, "num_predict": 80} + ) + + import re + response_text = response['response'] + + verdict = re.search(r'VERDICT:\s*(\w+)', response_text, re.I) + confidence = re.search(r'CONFIDENCE:\s*([\d.]+)', response_text, re.I) + reason = re.search(r'REASON:\s*(.+?)(?:\n|$)', response_text, re.I | re.DOTALL) + + return { + 'verdict': verdict.group(1).lower() if verdict else 'unknown', + 'confidence': float(confidence.group(1)) if confidence else 0.5, + 'reason': reason.group(1).strip() if reason else response_text[:150] + } + except Exception as e: + return {'verdict': 'error', 'confidence': 0.0, 'reason': str(e)} + + def generate_questions(self, text): + """Generate follow-up questions""" + prompt = f"""Generate 3 insightful questions about this. Reply ONLY with: +Q1: [question] +Q2: [question] +Q3: [question] + +Statement: "{text}" """ + + try: + response = ollama.generate( + model=self.model, + prompt=prompt, + options={"temperature": 0.7, "num_predict": 120} + ) + + import re + response_text = response['response'] + questions = [] + + for i in range(1, 4): + q_match = re.search(rf'Q{i}:\s*(.+?)(?:\n|$)', response_text, re.I) + if q_match: + question = q_match.group(1).strip() + if not question.endswith('?'): + question += '?' + questions.append(question) + + # Fallback defaults + while len(questions) < 3: + defaults = ["What are the implications?", "What evidence supports this?", "What's the context?"] + questions.append(defaults[len(questions)]) + + return questions[:3] + except Exception as e: + return ["What are the key points?", "What supports this?", "What are the implications?"] + + +def save_transcript(text, source, timestamp, filename): + """Append transcript to file""" + os.makedirs(os.path.dirname(filename) if os.path.dirname(filename) else '.', exist_ok=True) + with open(filename, "a", encoding="utf-8") as f: + source_label = "MIC" if source == 'mic' else "SPEAKER" + f.write(f"[{timestamp}] {source_label}: {text}\n") + + +def save_enriched_transcript(text, source, timestamp, fact_check, questions, filename): + """Save enriched transcript with LLM analysis""" + os.makedirs(os.path.dirname(filename) if os.path.dirname(filename) else '.', exist_ok=True) + with open(filename, "a", encoding="utf-8") as f: + source_label = "MIC" if source == 'mic' else "SPEAKER" + f.write(f"\n{'='*70}\n") + f.write(f"[{timestamp}] {source_label}: {text}\n\n") + + if fact_check: + f.write(f"📊 Fact Check: {fact_check['verdict'].upper()} ") + f.write(f"(confidence: {fact_check['confidence']:.2f})\n") + f.write(f"💡 {fact_check['reason']}\n\n") + + if questions: + f.write("❓ Questions:\n") + for i, q in enumerate(questions, 1): + f.write(f"{i}. {q}\n") + f.write("\n") + + +def main(): + parser = argparse.ArgumentParser(description="Real-time audio transcription with dual capture") + parser.add_argument("--model", default="tiny", choices=["tiny", "base", "small", "medium", "large"], + help="Whisper model (default: tiny)") + parser.add_argument("--language", default="en", help="Language code (default: en)") + parser.add_argument("--mic", help="Microphone device name (partial match)") + parser.add_argument("--monitor", help="Monitor device name for speaker capture") + parser.add_argument("--interval", type=float, default=5.0, help="Processing interval in seconds (default: 5.0)") + parser.add_argument("--min-duration", type=float, default=2.0, help="Minimum audio duration (default: 2.0)") + parser.add_argument("--enable-llm", action="store_true", help="Enable LLM analysis (fact-checking + questions)") + parser.add_argument("--llm-model", default="qwen2.5:3b", help="Ollama model (default: qwen2.5:3b)") + parser.add_argument("--output", "-o", help="Save transcript to file") + parser.add_argument("--list-devices", action="store_true", help="List audio devices and exit") + parser.add_argument("--force-cpu", action="store_true", help="Force CPU processing") + + args = parser.parse_args() + + if args.list_devices: + print("\nAvailable audio devices:") + for i, dev in enumerate(sd.query_devices()): + in_ch = dev['max_input_channels'] + out_ch = dev['max_output_channels'] + if in_ch > 0: + print(f" [{i:2d}] {dev['name']:<50} IN:{in_ch} OUT:{out_ch}") + return + + print("=== Real-Time Audio Transcription ===") + print(f"Model: {args.model} | Language: {args.language} | Interval: {args.interval}s") + if args.output: + print(f"Output: {args.output}") + if args.enable_llm: + print(f"LLM Analysis: Enabled ({args.llm_model})") + + # Initialize capture + try: + capturer = DualAudioCapture( + mic_device=args.mic, + monitor_device=args.monitor, + sample_rate=16000, + chunk_size=2048 + ) + except Exception as e: + print(f"\n❌ Audio Error: {e}") + print("\nTip: Use --list-devices to see available devices") + print(" Use --mic and --monitor to specify devices") + return + + # Initialize transcriber + try: + transcriber = WhisperTranscriber( + model_name=args.model, + language=args.language, + force_cpu=args.force_cpu + ) + except Exception as e: + print(f"\n❌ Whisper Error: {e}") + return + + # Initialize LLM analyzer + llm_analyzer = None + if args.enable_llm: + try: + llm_analyzer = LLMAnalyzer(model=args.llm_model) + except Exception as e: + print(f"\n⚠ LLM Error: {e}") + print("Continuing without LLM analysis...") + + # Main loop + print(f"\n✅ Started. Press Ctrl+C to stop.\n{'='*60}") + last_process = time.time() + + try: + while True: + # Collect audio + chunk = capturer.read_chunk() + if chunk: + source, audio = chunk + transcriber.add_audio(source, audio) + + # Process at intervals + if time.time() - last_process >= args.interval: + results = transcriber.transcribe_chunk(min_duration=args.min_duration) + + if results: + timestamp = datetime.now().strftime("%H:%M:%S") + + for source, text in results.items(): + if text: + source_emoji = "🎤" if source == 'mic' else "🔊" + print(f"\n{source_emoji} [{timestamp}] {text}") + + # LLM analysis + fact_check = None + questions = None + if llm_analyzer: + fact_check = llm_analyzer.fact_check(text) + questions = llm_analyzer.generate_questions(text) + + verdict_emoji = {'factual': '✅', 'dubious': '⚠️', 'false': '❌'}.get( + fact_check['verdict'], '❓') + print(f" {verdict_emoji} {fact_check['verdict'].upper()} " + f"({fact_check['confidence']:.2f}): {fact_check['reason']}") + print(f" ❓ Questions:") + for i, q in enumerate(questions, 1): + print(f" {i}. {q}") + + # Save to file + if args.output: + if llm_analyzer: + save_enriched_transcript(text, source, timestamp, fact_check, questions, args.output) + else: + save_transcript(text, source, timestamp, args.output) + + last_process = time.time() + + except KeyboardInterrupt: + print(f"\n{'='*60}\n🛑 Stopping...") + + capturer.close() + if args.output and os.path.exists(args.output): + print(f"\n💾 Transcript saved: {os.path.abspath(args.output)}") + print("\n✅ Done!") + + +if __name__ == "__main__": + main() diff --git a/transcribe_dual_linux.py b/transcribe_dual_linux.py index 039a16c..704ca78 100755 --- a/transcribe_dual_linux.py +++ b/transcribe_dual_linux.py @@ -1,7 +1,7 @@ #!/usr/bin/env python3 """ -Real-time transcription with dual audio capture (microphone + speaker monitor). -Linux/PipeWire optimized with Ollama LLM fact-checking. +Real-time audio transcription with dual capture and optional LLM analysis. +Supports microphone + speaker monitor, file output, and fact-checking. """ import sounddevice as sd @@ -9,6 +9,7 @@ import numpy as np import threading import queue import time +import os import argparse from datetime import datetime from faster_whisper import WhisperModel @@ -197,8 +198,8 @@ class WhisperTranscriber: return None -class LLMFactChecker: - """Fast fact-checking with Ollama""" +class LLMAnalyzer: + """LLM analysis with fact-checking and question generation""" def __init__(self, model="qwen2.5:3b"): if not OLLAMA_AVAILABLE: @@ -228,34 +229,100 @@ Statement: "{text}" """ ) import re - text = response['response'] + response_text = response['response'] - verdict = re.search(r'VERDICT:\s*(\w+)', text, re.I) - confidence = re.search(r'CONFIDENCE:\s*([\d.]+)', text, re.I) - reason = re.search(r'REASON:\s*(.+?)(?:\n|$)', text, re.I | re.DOTALL) + verdict = re.search(r'VERDICT:\s*(\w+)', response_text, re.I) + confidence = re.search(r'CONFIDENCE:\s*([\d.]+)', response_text, re.I) + reason = re.search(r'REASON:\s*(.+?)(?:\n|$)', response_text, re.I | re.DOTALL) return { 'verdict': verdict.group(1).lower() if verdict else 'unknown', 'confidence': float(confidence.group(1)) if confidence else 0.5, - 'reason': reason.group(1).strip() if reason else text[:150] + 'reason': reason.group(1).strip() if reason else response_text[:150] } except Exception as e: return {'verdict': 'error', 'confidence': 0.0, 'reason': str(e)} + def generate_questions(self, text): + """Generate follow-up questions""" + prompt = f"""Generate 3 insightful questions about this. Reply ONLY with: +Q1: [question] +Q2: [question] +Q3: [question] + +Statement: "{text}" """ + + try: + response = ollama.generate( + model=self.model, + prompt=prompt, + options={"temperature": 0.7, "num_predict": 120} + ) + + import re + response_text = response['response'] + questions = [] + + for i in range(1, 4): + q_match = re.search(rf'Q{i}:\s*(.+?)(?:\n|$)', response_text, re.I) + if q_match: + question = q_match.group(1).strip() + if not question.endswith('?'): + question += '?' + questions.append(question) + + # Fallback defaults + while len(questions) < 3: + defaults = ["What are the implications?", "What evidence supports this?", "What's the context?"] + questions.append(defaults[len(questions)]) + + return questions[:3] + except Exception as e: + return ["What are the key points?", "What supports this?", "What are the implications?"] + + +def save_transcript(text, source, timestamp, filename): + """Append transcript to file""" + os.makedirs(os.path.dirname(filename) if os.path.dirname(filename) else '.', exist_ok=True) + with open(filename, "a", encoding="utf-8") as f: + source_label = "MIC" if source == 'mic' else "SPEAKER" + f.write(f"[{timestamp}] {source_label}: {text}\n") + + +def save_enriched_transcript(text, source, timestamp, fact_check, questions, filename): + """Save enriched transcript with LLM analysis""" + os.makedirs(os.path.dirname(filename) if os.path.dirname(filename) else '.', exist_ok=True) + with open(filename, "a", encoding="utf-8") as f: + source_label = "MIC" if source == 'mic' else "SPEAKER" + f.write(f"\n{'='*70}\n") + f.write(f"[{timestamp}] {source_label}: {text}\n\n") + + if fact_check: + f.write(f"📊 Fact Check: {fact_check['verdict'].upper()} ") + f.write(f"(confidence: {fact_check['confidence']:.2f})\n") + f.write(f"💡 {fact_check['reason']}\n\n") + + if questions: + f.write("❓ Questions:\n") + for i, q in enumerate(questions, 1): + f.write(f"{i}. {q}\n") + f.write("\n") + def main(): - parser = argparse.ArgumentParser(description="Dual audio transcription with fact-checking") - parser.add_argument("--model", default="tiny", choices=["tiny", "base", "small", "medium"], - help="Whisper model (default: tiny for speed)") - parser.add_argument("--language", default="en", help="Language code") + parser = argparse.ArgumentParser(description="Real-time audio transcription with dual capture") + parser.add_argument("--model", default="tiny", choices=["tiny", "base", "small", "medium", "large"], + help="Whisper model (default: tiny)") + parser.add_argument("--language", default="en", help="Language code (default: en)") parser.add_argument("--mic", help="Microphone device name (partial match)") parser.add_argument("--monitor", help="Monitor device name for speaker capture") - parser.add_argument("--interval", type=float, default=5.0, help="Processing interval (seconds)") - parser.add_argument("--min-duration", type=float, default=2.0, help="Min audio duration") - parser.add_argument("--enable-llm", action="store_true", help="Enable fact-checking") - parser.add_argument("--llm-model", default="qwen2.5:3b", help="Ollama model") - parser.add_argument("--list-devices", action="store_true", help="List audio devices") - parser.add_argument("--force-cpu", action="store_true", help="Force CPU") + parser.add_argument("--interval", type=float, default=5.0, help="Processing interval in seconds (default: 5.0)") + parser.add_argument("--min-duration", type=float, default=2.0, help="Minimum audio duration (default: 2.0)") + parser.add_argument("--enable-llm", action="store_true", help="Enable LLM analysis (fact-checking + questions)") + parser.add_argument("--llm-model", default="qwen2.5:3b", help="Ollama model (default: qwen2.5:3b)") + parser.add_argument("--output", "-o", help="Save transcript to file") + parser.add_argument("--list-devices", action="store_true", help="List audio devices and exit") + parser.add_argument("--force-cpu", action="store_true", help="Force CPU processing") args = parser.parse_args() @@ -268,8 +335,12 @@ def main(): print(f" [{i:2d}] {dev['name']:<50} IN:{in_ch} OUT:{out_ch}") return - print("=== Dual Audio Transcription with Fact-Checking ===") + print("=== Real-Time Audio Transcription ===") print(f"Model: {args.model} | Language: {args.language} | Interval: {args.interval}s") + if args.output: + print(f"Output: {args.output}") + if args.enable_llm: + print(f"LLM Analysis: Enabled ({args.llm_model})") # Initialize capture try: @@ -296,14 +367,14 @@ def main(): print(f"\n❌ Whisper Error: {e}") return - # Initialize fact checker - fact_checker = None + # Initialize LLM analyzer + llm_analyzer = None if args.enable_llm: try: - fact_checker = LLMFactChecker(model=args.llm_model) + llm_analyzer = LLMAnalyzer(model=args.llm_model) except Exception as e: print(f"\n⚠ LLM Error: {e}") - print("Continuing without fact-checking...") + print("Continuing without LLM analysis...") # Main loop print(f"\n✅ Started. Press Ctrl+C to stop.\n{'='*60}") @@ -329,10 +400,27 @@ def main(): source_emoji = "🎤" if source == 'mic' else "🔊" print(f"\n{source_emoji} [{timestamp}] {text}") - if fact_checker: - fc = fact_checker.fact_check(text) - verdict_emoji = {'factual': '✅', 'dubious': '⚠️', 'false': '❌'}.get(fc['verdict'], '❓') - print(f" {verdict_emoji} {fc['verdict'].upper()} ({fc['confidence']:.2f}): {fc['reason']}") + # LLM analysis + fact_check = None + questions = None + if llm_analyzer: + fact_check = llm_analyzer.fact_check(text) + questions = llm_analyzer.generate_questions(text) + + verdict_emoji = {'factual': '✅', 'dubious': '⚠️', 'false': '❌'}.get( + fact_check['verdict'], '❓') + print(f" {verdict_emoji} {fact_check['verdict'].upper()} " + f"({fact_check['confidence']:.2f}): {fact_check['reason']}") + print(f" ❓ Questions:") + for i, q in enumerate(questions, 1): + print(f" {i}. {q}") + + # Save to file + if args.output: + if llm_analyzer: + save_enriched_transcript(text, source, timestamp, fact_check, questions, args.output) + else: + save_transcript(text, source, timestamp, args.output) last_process = time.time() @@ -340,6 +428,8 @@ def main(): print(f"\n{'='*60}\n🛑 Stopping...") capturer.close() + if args.output and os.path.exists(args.output): + print(f"\n💾 Transcript saved: {os.path.abspath(args.output)}") print("\n✅ Done!") diff --git a/transcribe_duil_linux_old.py b/transcribe_duil_linux_old.py new file mode 100644 index 0000000..039a16c --- /dev/null +++ b/transcribe_duil_linux_old.py @@ -0,0 +1,347 @@ +#!/usr/bin/env python3 +""" +Real-time transcription with dual audio capture (microphone + speaker monitor). +Linux/PipeWire optimized with Ollama LLM fact-checking. +""" + +import sounddevice as sd +import numpy as np +import threading +import queue +import time +import argparse +from datetime import datetime +from faster_whisper import WhisperModel + +try: + import ollama + OLLAMA_AVAILABLE = True +except ImportError: + OLLAMA_AVAILABLE = False + + +class DualAudioCapture: + """Capture both microphone and speaker output simultaneously""" + + def __init__(self, mic_device=None, monitor_device=None, sample_rate=16000, chunk_size=2048): + self.sample_rate = sample_rate + self.chunk_size = chunk_size + self.audio_queue = queue.Queue() + + # Find devices + devices = sd.query_devices() + + # Microphone (default input or specified) + if mic_device is None: + self.mic_device = sd.default.device[0] # Default input + else: + self.mic_device = self._find_device(mic_device, input_required=True) + + # Monitor/Loopback (for speaker output) + if monitor_device: + self.monitor_device = self._find_device(monitor_device, input_required=True) + else: + self.monitor_device = None + + print(f"✓ Microphone: {devices[self.mic_device]['name']} (index {self.mic_device})") + if self.monitor_device: + print(f"✓ Monitor: {devices[self.monitor_device]['name']} (index {self.monitor_device})") + else: + print("⚠ No monitor device - capturing microphone only") + + # Start streams + self.mic_stream = sd.InputStream( + device=self.mic_device, + channels=1, + samplerate=sample_rate, + blocksize=chunk_size, + dtype='int16', + callback=self._mic_callback + ) + + if self.monitor_device: + self.monitor_stream = sd.InputStream( + device=self.monitor_device, + channels=1, + samplerate=sample_rate, + blocksize=chunk_size, + dtype='int16', + callback=self._monitor_callback + ) + else: + self.monitor_stream = None + + self.mic_stream.start() + if self.monitor_stream: + self.monitor_stream.start() + + print("✓ Audio capture started") + + def _find_device(self, device_name, input_required=True): + """Find device by name substring""" + devices = sd.query_devices() + for i, dev in enumerate(devices): + if device_name.lower() in dev['name'].lower(): + if not input_required or dev['max_input_channels'] > 0: + return i + raise RuntimeError(f"Device '{device_name}' not found") + + def _mic_callback(self, indata, frames, time_info, status): + """Microphone audio callback""" + if status: + print(f"⚠ Mic status: {status}") + self.audio_queue.put(('mic', indata.copy())) + + def _monitor_callback(self, indata, frames, time_info, status): + """Monitor/speaker audio callback""" + if status: + print(f"⚠ Monitor status: {status}") + self.audio_queue.put(('monitor', indata.copy())) + + def read_chunk(self): + """Read audio data from queue""" + try: + return self.audio_queue.get(timeout=0.05) + except queue.Empty: + return None + + def close(self): + """Cleanup resources""" + self.mic_stream.stop() + self.mic_stream.close() + if self.monitor_stream: + self.monitor_stream.stop() + self.monitor_stream.close() + + +class WhisperTranscriber: + """Process audio with Whisper""" + + def __init__(self, model_name="base", language="en", force_cpu=False): + print(f"Loading Whisper model '{model_name}'...") + + import torch + has_cuda = torch.cuda.is_available() and not force_cpu + + device = "cpu" + compute_type = "int8" + + if has_cuda: + try: + import ctranslate2 + if ctranslate2.get_cuda_device_count() > 0: + device = "cuda" + compute_type = "float16" + print(f"✓ Using GPU: {torch.cuda.get_device_name(0)}") + except Exception as e: + print(f"⚠ CUDA unavailable: {e}") + + if device == "cpu": + print("✓ Using CPU") + + model_kwargs = {"device": device, "compute_type": compute_type} + if device == "cpu": + model_kwargs["cpu_threads"] = 4 + + self.model = WhisperModel(model_name, **model_kwargs) + self.language = language + self.mic_buffer = np.array([], dtype=np.float32) + self.monitor_buffer = np.array([], dtype=np.float32) + self.lock = threading.Lock() + + def add_audio(self, source, audio_chunk): + """Add audio to appropriate buffer""" + with self.lock: + audio_float = audio_chunk.flatten().astype(np.float32) / 32768.0 + if source == 'mic': + self.mic_buffer = np.concatenate([self.mic_buffer, audio_float]) + else: + self.monitor_buffer = np.concatenate([self.monitor_buffer, audio_float]) + + def transcribe_chunk(self, min_duration=3.0): + """Transcribe accumulated audio""" + with self.lock: + mic_duration = len(self.mic_buffer) / 16000 + monitor_duration = len(self.monitor_buffer) / 16000 + + results = {} + + # Transcribe microphone + if mic_duration >= min_duration: + mic_audio = self.mic_buffer.copy() + self.mic_buffer = np.array([], dtype=np.float32) + results['mic'] = self._transcribe(mic_audio) + + # Transcribe monitor + if monitor_duration >= min_duration: + monitor_audio = self.monitor_buffer.copy() + self.monitor_buffer = np.array([], dtype=np.float32) + results['monitor'] = self._transcribe(monitor_audio) + + return results if results else None + + def _transcribe(self, audio): + """Internal transcription""" + try: + segments, _ = self.model.transcribe( + audio, + language=self.language, + beam_size=3, # Faster than default 5 + vad_filter=True, + vad_parameters=dict(min_silence_duration_ms=500) + ) + text = " ".join([seg.text for seg in segments]).strip() + return text if text else None + except Exception as e: + print(f"❌ Transcription error: {e}") + return None + + +class LLMFactChecker: + """Fast fact-checking with Ollama""" + + def __init__(self, model="qwen2.5:3b"): + if not OLLAMA_AVAILABLE: + raise RuntimeError("Ollama not installed: pip install ollama") + + self.model = model + try: + ollama.list() + print(f"✓ Ollama connected: {self.model}") + except Exception as e: + raise RuntimeError(f"Ollama not running: {e}") + + def fact_check(self, text): + """Quick fact-check""" + prompt = f"""Fact-check this statement. Reply ONLY with: +VERDICT: factual/dubious/false +CONFIDENCE: 0.0-1.0 +REASON: one sentence + +Statement: "{text}" """ + + try: + response = ollama.generate( + model=self.model, + prompt=prompt, + options={"temperature": 0.1, "num_predict": 80} + ) + + import re + text = response['response'] + + verdict = re.search(r'VERDICT:\s*(\w+)', text, re.I) + confidence = re.search(r'CONFIDENCE:\s*([\d.]+)', text, re.I) + reason = re.search(r'REASON:\s*(.+?)(?:\n|$)', text, re.I | re.DOTALL) + + return { + 'verdict': verdict.group(1).lower() if verdict else 'unknown', + 'confidence': float(confidence.group(1)) if confidence else 0.5, + 'reason': reason.group(1).strip() if reason else text[:150] + } + except Exception as e: + return {'verdict': 'error', 'confidence': 0.0, 'reason': str(e)} + + +def main(): + parser = argparse.ArgumentParser(description="Dual audio transcription with fact-checking") + parser.add_argument("--model", default="tiny", choices=["tiny", "base", "small", "medium"], + help="Whisper model (default: tiny for speed)") + parser.add_argument("--language", default="en", help="Language code") + parser.add_argument("--mic", help="Microphone device name (partial match)") + parser.add_argument("--monitor", help="Monitor device name for speaker capture") + parser.add_argument("--interval", type=float, default=5.0, help="Processing interval (seconds)") + parser.add_argument("--min-duration", type=float, default=2.0, help="Min audio duration") + parser.add_argument("--enable-llm", action="store_true", help="Enable fact-checking") + parser.add_argument("--llm-model", default="qwen2.5:3b", help="Ollama model") + parser.add_argument("--list-devices", action="store_true", help="List audio devices") + parser.add_argument("--force-cpu", action="store_true", help="Force CPU") + + args = parser.parse_args() + + if args.list_devices: + print("\nAvailable audio devices:") + for i, dev in enumerate(sd.query_devices()): + in_ch = dev['max_input_channels'] + out_ch = dev['max_output_channels'] + if in_ch > 0: + print(f" [{i:2d}] {dev['name']:<50} IN:{in_ch} OUT:{out_ch}") + return + + print("=== Dual Audio Transcription with Fact-Checking ===") + print(f"Model: {args.model} | Language: {args.language} | Interval: {args.interval}s") + + # Initialize capture + try: + capturer = DualAudioCapture( + mic_device=args.mic, + monitor_device=args.monitor, + sample_rate=16000, + chunk_size=2048 + ) + except Exception as e: + print(f"\n❌ Audio Error: {e}") + print("\nTip: Use --list-devices to see available devices") + print(" Use --mic and --monitor to specify devices") + return + + # Initialize transcriber + try: + transcriber = WhisperTranscriber( + model_name=args.model, + language=args.language, + force_cpu=args.force_cpu + ) + except Exception as e: + print(f"\n❌ Whisper Error: {e}") + return + + # Initialize fact checker + fact_checker = None + if args.enable_llm: + try: + fact_checker = LLMFactChecker(model=args.llm_model) + except Exception as e: + print(f"\n⚠ LLM Error: {e}") + print("Continuing without fact-checking...") + + # Main loop + print(f"\n✅ Started. Press Ctrl+C to stop.\n{'='*60}") + last_process = time.time() + + try: + while True: + # Collect audio + chunk = capturer.read_chunk() + if chunk: + source, audio = chunk + transcriber.add_audio(source, audio) + + # Process at intervals + if time.time() - last_process >= args.interval: + results = transcriber.transcribe_chunk(min_duration=args.min_duration) + + if results: + timestamp = datetime.now().strftime("%H:%M:%S") + + for source, text in results.items(): + if text: + source_emoji = "🎤" if source == 'mic' else "🔊" + print(f"\n{source_emoji} [{timestamp}] {text}") + + if fact_checker: + fc = fact_checker.fact_check(text) + verdict_emoji = {'factual': '✅', 'dubious': '⚠️', 'false': '❌'}.get(fc['verdict'], '❓') + print(f" {verdict_emoji} {fc['verdict'].upper()} ({fc['confidence']:.2f}): {fc['reason']}") + + last_process = time.time() + + except KeyboardInterrupt: + print(f"\n{'='*60}\n🛑 Stopping...") + + capturer.close() + print("\n✅ Done!") + + +if __name__ == "__main__": + main()