chore: update 6 file(s)
This commit is contained in:
249
README.md
249
README.md
@@ -1,16 +1,15 @@
|
||||
# Verbatim Dicta
|
||||
|
||||
Real-time audio transcription using Whisper AI with optional LLM-powered analysis. Captures system audio via loopback and transcribes it with configurable models and processing options.
|
||||
Real-time audio transcription using Whisper AI with optional LLM analysis. Captures microphone and speaker audio simultaneously for comprehensive transcription.
|
||||
|
||||
## Features
|
||||
|
||||
- Real-time transcription of system audio (Windows/Linux)
|
||||
- Multiple Whisper model sizes (tiny to large)
|
||||
- Multi-language support
|
||||
- **Sentence extraction mode** - Stitches audio chunks into complete sentences
|
||||
- Optional LLM analysis for fact-checking and question generation (via Ollama)
|
||||
- GPU acceleration support
|
||||
- Flexible audio device configuration
|
||||
- **Dual audio capture** - Record microphone and speaker output simultaneously
|
||||
- **Real-time transcription** - Process audio as it's captured with Whisper models
|
||||
- **LLM analysis** - Optional fact-checking and question generation via Ollama
|
||||
- **Multi-language** - Support for 50+ languages
|
||||
- **File output** - Save transcripts with timestamps and analysis
|
||||
- **GPU acceleration** - CUDA support for faster processing
|
||||
|
||||
## Quick Start
|
||||
|
||||
@@ -18,17 +17,14 @@ Real-time audio transcription using Whisper AI with optional LLM-powered analysi
|
||||
# Install dependencies
|
||||
pip install -r requirements.txt
|
||||
|
||||
# Basic transcription (no LLM)
|
||||
python transcribe_speakers.py
|
||||
|
||||
# With LLM analysis (optional)
|
||||
python transcribe_speakers.py --enable-llm
|
||||
|
||||
# With sentence extraction
|
||||
python transcribe_speakers.py --sentence-mode
|
||||
|
||||
# List audio devices
|
||||
python transcribe_speakers.py --list-devices
|
||||
./run_transcribe.sh --list-devices
|
||||
|
||||
# Basic transcription
|
||||
./run_transcribe.sh --model medium --language en
|
||||
|
||||
# With LLM analysis and file output
|
||||
./run_transcribe.sh --model medium --enable-llm --output transcript.txt
|
||||
```
|
||||
|
||||
## Requirements
|
||||
@@ -58,172 +54,153 @@ For CUDA 12.1:
|
||||
pip install torch==2.8.0+cu121 --index-url https://download.pytorch.org/whl/cu121
|
||||
```
|
||||
|
||||
### 3. Audio Loopback Setup
|
||||
### 3. Audio Setup
|
||||
|
||||
**Windows - Option A (Stereo Mix):**
|
||||
1. Right-click speaker icon → Sounds → Recording tab
|
||||
2. Right-click → Show Disabled Devices
|
||||
3. Enable and set Stereo Mix as default
|
||||
**Linux (PulseAudio/PipeWire):**
|
||||
```bash
|
||||
# List devices to find your monitor device
|
||||
./run_transcribe.sh --list-devices
|
||||
|
||||
**Windows - Option B (VB-Cable, recommended):**
|
||||
1. Download from [vb-audio.com](https://vb-audio.com/Cable/)
|
||||
2. Install and restart
|
||||
3. Use `--device "CABLE Output"`
|
||||
# Use with monitor device
|
||||
./run_transcribe.sh --monitor "alsa_output.monitor"
|
||||
```
|
||||
|
||||
**Linux:**
|
||||
Configure PulseAudio loopback or use `transcribe_dual_linux.py`
|
||||
**Windows:**
|
||||
- Enable "Stereo Mix" in Sound settings, or
|
||||
- Install VB-Cable from [vb-audio.com](https://vb-audio.com/Cable/)
|
||||
|
||||
### 4. LLM Features (Optional)
|
||||
### 4. LLM Support (Optional)
|
||||
|
||||
```bash
|
||||
# Install Ollama from ollama.ai
|
||||
ollama pull llama3.2
|
||||
ollama pull qwen2.5:3b
|
||||
```
|
||||
|
||||
## Usage
|
||||
|
||||
### Available Scripts
|
||||
|
||||
- `transcribe_speakers.py` - Main script with all features (LLM optional via `--enable-llm`)
|
||||
- `transcribe_dual_linux.py` - Linux-specific with dual audio support
|
||||
|
||||
### Common Commands
|
||||
### Command Line Options
|
||||
|
||||
```bash
|
||||
# Quick start with GPU (English)
|
||||
./RUN_GPU.sh
|
||||
python transcribe.py [OPTIONS]
|
||||
|
||||
# Dutch language
|
||||
./RUN_DUTCH.sh
|
||||
|
||||
# Dutch with LLM analysis
|
||||
./RUN_DUTCH_LLM.sh
|
||||
|
||||
# With LLM analysis
|
||||
./RUN_GPU.sh --enable-llm
|
||||
|
||||
# Save to file
|
||||
./RUN_GPU.sh --output transcript.txt
|
||||
|
||||
# Other languages (Spanish, French, German, etc.)
|
||||
./RUN_GPU.sh --language es # Spanish
|
||||
./RUN_GPU.sh --language fr # French
|
||||
./RUN_GPU.sh --language de # German
|
||||
|
||||
# Maximum accuracy with LLM and sentence extraction
|
||||
python transcribe_speakers.py --model large --enable-llm --sentence-mode --output enriched.txt
|
||||
|
||||
# Force CPU (if GPU issues)
|
||||
python transcribe_speakers.py --force-cpu
|
||||
Options:
|
||||
--model {tiny,base,small,medium,large} Whisper model (default: tiny)
|
||||
--language CODE Language code (default: en)
|
||||
--mic DEVICE Microphone device name
|
||||
--monitor DEVICE Speaker monitor device name
|
||||
--interval SECONDS Processing interval (default: 5.0)
|
||||
--min-duration SECONDS Minimum audio duration (default: 2.0)
|
||||
--enable-llm Enable LLM analysis
|
||||
--llm-model MODEL Ollama model (default: qwen2.5:3b)
|
||||
--output FILE Save transcript to file
|
||||
--force-cpu Force CPU processing
|
||||
--list-devices List audio devices
|
||||
```
|
||||
|
||||
### Key Options
|
||||
### Examples
|
||||
|
||||
| Option | Description | Default |
|
||||
|--------|-------------|---------|
|
||||
| `--model` | Model size: tiny/base/small/medium/large | base |
|
||||
| `--language` | Language code (en/es/fr/de/ja/etc.) | en |
|
||||
| `--device` | Audio device name (partial match) | Auto |
|
||||
| `--interval` | Processing interval (seconds) | 8.0 |
|
||||
| `--min-duration` | Minimum audio duration | 3.0 |
|
||||
| `--fast-mode` | Fast mode (3-5x faster, lower accuracy) | False |
|
||||
| `--enable-llm` | Enable fact-checking and questions | False |
|
||||
| `--llm-model` | Ollama model to use | llama3.2 |
|
||||
| `--output` | Save to file | None |
|
||||
| `--force-cpu` | Disable GPU | False |
|
||||
| `--gpu-index` | GPU device index | 0 |
|
||||
| `--sentence-mode` | Extract complete sentences from chunks | False |
|
||||
```bash
|
||||
# Dutch transcription with LLM
|
||||
./run_transcribe.sh --model medium --language nl --enable-llm
|
||||
|
||||
# High-quality meeting transcription
|
||||
./run_transcribe.sh --model large --interval 8 --output meeting.txt
|
||||
|
||||
# Fast real-time transcription
|
||||
./run_transcribe.sh --model tiny --interval 3 --min-duration 2
|
||||
|
||||
# Specific devices
|
||||
./run_transcribe.sh --mic "USB Mic" --monitor "Monitor of Speakers"
|
||||
```
|
||||
|
||||
## Model Performance
|
||||
|
||||
| Model | Size | Speed | Quality | Best For |
|
||||
|-------|------|-------|---------|----------|
|
||||
| tiny | ~75 MB | Fastest | Basic | Quick tests, low-latency |
|
||||
| base | ~145 MB | Fast | Good | General real-time use |
|
||||
| small | ~485 MB | Moderate | Better | Balanced accuracy/speed |
|
||||
| medium | ~1.5 GB | Slow | Great | High accuracy needs |
|
||||
| large | ~3 GB | Slowest | Best | Maximum accuracy |
|
||||
|
||||
## Optimization Presets
|
||||
|
||||
**Low Latency (Real-Time):**
|
||||
```bash
|
||||
python transcribe_speakers.py --model tiny --fast-mode --interval 2 --min-duration 1.5
|
||||
```
|
||||
|
||||
**Balanced:**
|
||||
```bash
|
||||
python transcribe_speakers.py --model base --interval 5
|
||||
```
|
||||
|
||||
**High Accuracy:**
|
||||
```bash
|
||||
python transcribe_speakers.py --model large --interval 10 --enable-llm
|
||||
```
|
||||
| Model | Size | Speed | Quality | Use Case |
|
||||
|--------|--------|----------|---------|------------------------|
|
||||
| tiny | 75 MB | Fastest | Basic | Real-time, low latency |
|
||||
| base | 145 MB | Fast | Good | General use |
|
||||
| small | 485 MB | Moderate | Better | Balanced |
|
||||
| medium | 1.5 GB | Slow | Great | High accuracy |
|
||||
| large | 3 GB | Slowest | Best | Maximum quality |
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
**No loopback device:**
|
||||
- Windows: Enable Stereo Mix or install VB-Cable
|
||||
- Linux: Configure PulseAudio loopback
|
||||
**No audio devices found:**
|
||||
```bash
|
||||
# List all devices
|
||||
./run_transcribe.sh --list-devices
|
||||
|
||||
# Specify devices explicitly
|
||||
./run_transcribe.sh --mic "device_name" --monitor "monitor_name"
|
||||
```
|
||||
|
||||
**CUDA errors:**
|
||||
```bash
|
||||
python transcribe_speakers.py --force-cpu
|
||||
# Force CPU processing
|
||||
./run_transcribe.sh --force-cpu
|
||||
```
|
||||
|
||||
**No audio captured:**
|
||||
- Verify audio is playing
|
||||
- Check device: `--list-devices`
|
||||
- Increase system volume
|
||||
**Ollama connection failed:**
|
||||
```bash
|
||||
# Start Ollama service
|
||||
ollama serve
|
||||
|
||||
**Poor quality:**
|
||||
- Use larger model: `--model medium`
|
||||
# Pull required model
|
||||
ollama pull qwen2.5:3b
|
||||
```
|
||||
|
||||
**Poor transcription quality:**
|
||||
- Use larger model: `--model medium` or `--model large`
|
||||
- Increase interval: `--interval 10`
|
||||
- Specify language: `--language <code>`
|
||||
|
||||
**Ollama errors:**
|
||||
- Ensure Ollama is running
|
||||
- Pull model: `ollama pull llama3.2`
|
||||
- Specify language: `--language nl`
|
||||
- Ensure good audio quality (reduce background noise)
|
||||
|
||||
## Output Format
|
||||
|
||||
**Standard:**
|
||||
### Standard Output
|
||||
```
|
||||
[14:23:15] Transcribed audio segment.
|
||||
[14:23:23] Another segment with timestamp.
|
||||
🎤 [14:23:15] User speaking into microphone
|
||||
🔊 [14:23:18] Audio from speakers or system
|
||||
```
|
||||
|
||||
**With LLM (--enable-llm):**
|
||||
### With LLM Analysis
|
||||
```
|
||||
🎤 [14:23:15] The Earth orbits the Sun in 365 days.
|
||||
✅ FACTUAL (0.98): Scientifically accurate orbital period.
|
||||
❓ Questions:
|
||||
1. Why do we need leap years?
|
||||
2. How does the elliptical orbit affect seasons?
|
||||
3. What factors influence Earth's orbital velocity?
|
||||
```
|
||||
|
||||
### File Output
|
||||
```
|
||||
[14:23:15] MIC: User speaking into microphone
|
||||
[14:23:18] SPEAKER: Audio from speakers
|
||||
|
||||
======================================================================
|
||||
[14:23:15] The Earth revolves around the Sun in 365 days.
|
||||
[14:23:25] MIC: The Earth orbits the Sun in 365 days.
|
||||
|
||||
📊 Fact Check: FACTUAL (confidence: 0.98)
|
||||
💡 Scientifically accurate. Earth's orbital period is 365.25 days.
|
||||
💡 Scientifically accurate orbital period.
|
||||
|
||||
❓ Questions:
|
||||
1. Why do we need leap years?
|
||||
2. How does Earth's orbit affect seasons?
|
||||
======================================================================
|
||||
2. How does the elliptical orbit affect seasons?
|
||||
3. What factors influence Earth's orbital velocity?
|
||||
```
|
||||
|
||||
## Technical Stack
|
||||
## Architecture
|
||||
|
||||
- **Audio**: sounddevice, soundfile (16kHz mono, 16-bit PCM)
|
||||
- **Transcription**: faster-whisper (optimized Whisper)
|
||||
- **LLM**: Ollama (local inference)
|
||||
- **Capture**: WASAPI loopback (Windows), PulseAudio (Linux)
|
||||
- **Audio Capture**: sounddevice with dual-stream support
|
||||
- **Transcription**: faster-whisper (optimized Whisper implementation)
|
||||
- **LLM**: Ollama for local inference
|
||||
- **Format**: 16kHz mono, 16-bit PCM
|
||||
- **Processing**: Independent mic/speaker buffers with beam_size=3
|
||||
|
||||
## Future Work
|
||||
## Contributing
|
||||
|
||||
- Real-time streaming transcription with reduced buffering
|
||||
- Speaker diarization improvements
|
||||
- Web interface for remote monitoring
|
||||
- Multi-device simultaneous transcription
|
||||
- Cloud LLM integration options
|
||||
- Custom vocabulary and domain adaptation
|
||||
- Noise reduction preprocessing
|
||||
Contributions welcome! Please open issues or submit pull requests.
|
||||
|
||||
## License
|
||||
|
||||
|
||||
Reference in New Issue
Block a user