QuentinFuxa

    QuentinFuxa/WhisperLiveKit

    #167 this week

    Simultaneous speech-to-text models

    backend
    Python
    Apache-2.0
    10.2K stars
    1.0K forks
    10.2K watching
    Updated 5/4/2026
    View on GitHub
    Partner

    Genblaze — open-source SDK for generative multimedia pipelines

    Learn more

    Loading star history...

    Health score
    49
    Weekly growth
    +0
    +0.0% this week
    Contributors
    35
    Open issues
    26
    Activity
    72
    Community
    50
    Maintenance
    35
    Last release42d ago

    Use Cases & Benefits

    • WhisperLiveKit provides real-time, fully local speech-to-text transcription with speaker diarization via a Python FastAPI server and web interface.
    • It integrates state-of-the-art technologies like SimulStreaming, WhisperStreaming, Streaming Sortformer, Diart, and Silero VAD for low-latency transcription and speaker identification.
    • Strengths include ultra-low latency, multi-user support, voice activity detection, and flexible backend options; limitations involve dependency on FFmpeg and optional complex diarization setup.
    • With 1272 stars and 199 forks since late 2024, it shows strong community adoption and active maintenance for real-time speech transcription solutions.
    • Ideal for developers needing live transcription with speaker diarization in meetings, accessibility tools, podcasts, customer support, or any real-time audio processing application.

    About WhisperLiveKit

    WhisperLiveKit

    WhisperLiveKit Demo

    Real-time, Fully Local Speech-to-Text with Speaker Identification

    PyPI Version PyPI Downloads Python Versions License

    Real-time speech transcription directly to your browser, with a ready-to-use backend+server and a simple frontend. ✨

    Powered by Leading Research:

    • SimulStreaming (SOTA 2025) - Ultra-low latency transcription with AlignAtt policy
    • WhisperStreaming (SOTA 2023) - Low latency transcription with LocalAgreement policy
    • Streaming Sortformer (SOTA 2025) - Advanced real-time speaker diarization
    • Diart (SOTA 2021) - Real-time speaker diarization
    • Silero VAD (2024) - Enterprise-grade Voice Activity Detection

    Why not just run a simple Whisper model on every audio batch? Whisper is designed for complete utterances, not real-time chunks. Processing small segments loses context, cuts off words mid-syllable, and produces poor transcription. WhisperLiveKit uses state-of-the-art simultaneous speech research for intelligent buffering and incremental processing.

    Architecture

    Architecture

    The backend supports multiple concurrent users. Voice Activity Detection reduces overhead when no voice is detected.

    Installation & Quick Start

    pip install whisperlivekit
    

    FFmpeg is required and must be installed before using WhisperLiveKit

    OSHow to install
    Ubuntu/Debiansudo apt install ffmpeg
    MacOSbrew install ffmpeg
    WindowsDownload .exe from https://ffmpeg.org/download.html and add to PATH

    Quick Start

    1. Start the transcription server:

      whisperlivekit-server --model base --language en
      
    2. Open your browser and navigate to http://localhost:8000. Start speaking and watch your words appear in real-time!

    • See tokenizer.py for the list of all available languages.
    • For HTTPS requirements, see the Parameters section for SSL configuration options.

    Optional Dependencies

    Optionalpip install
    Speaker diarization with Sortformergit+https://github.com/NVIDIA/NeMo.git@main#egg=nemo_toolkit[asr]
    Speaker diarization with Diartdiart
    Original Whisper backendwhisperlivekit[whisper]
    Improved timestamps backendwhisperlivekit[whisper-timestamped]
    Apple Silicon optimization backendwhisperlivekit[mlx-whisper]
    OpenAI API backendwhisperlivekit[openai]

    See Parameters & Configuration below on how to use them.

    Pyannote Models Setup For diarization, you need access to pyannote.audio models:

    1. Accept user conditions for the pyannote/segmentation model
    2. Accept user conditions for the pyannote/segmentation-3.0 model
    3. Accept user conditions for the pyannote/embedding model
    4. Login with HuggingFace:
    huggingface-cli login
    

    💻 Usage Examples

    Command-line Interface

    Start the transcription server with various options:

    # SimulStreaming backend for ultra-low latency
    whisperlivekit-server --backend simulstreaming --model large-v3
    
    # Advanced configuration with diarization
    whisperlivekit-server --host 0.0.0.0 --port 8000 --model medium --diarization --language fr
    

    Python API Integration (Backend)

    Check basic_server for a more complete example of how to use the functions and classes.

    from whisperlivekit import TranscriptionEngine, AudioProcessor, parse_args
    from fastapi import FastAPI, WebSocket, WebSocketDisconnect
    from fastapi.responses import HTMLResponse
    from contextlib import asynccontextmanager
    import asyncio
    
    transcription_engine = None
    
    @asynccontextmanager
    async def lifespan(app: FastAPI):
        global transcription_engine
        transcription_engine = TranscriptionEngine(model="medium", diarization=True, lan="en")
        yield
    
    app = FastAPI(lifespan=lifespan)
    
    async def handle_websocket_results(websocket: WebSocket, results_generator):
        async for response in results_generator:
            await websocket.send_json(response)
        await websocket.send_json({"type": "ready_to_stop"})
    
    @app.websocket("/asr")
    async def websocket_endpoint(websocket: WebSocket):
        global transcription_engine
    
        # Create a new AudioProcessor for each connection, passing the shared engine
        audio_processor = AudioProcessor(transcription_engine=transcription_engine)    
        results_generator = await audio_processor.create_tasks()
        results_task = asyncio.create_task(handle_websocket_results(websocket, results_generator))
        await websocket.accept()
        while True:
            message = await websocket.receive_bytes()
            await audio_processor.process_audio(message)        
    

    Frontend Implementation

    The package includes an HTML/JavaScript implementation here. You can also import it using from whisperlivekit import get_web_interface_html & page = get_web_interface_html()

    ⚙️ Parameters & Configuration

    ParameterDescriptionDefault
    --modelWhisper model size.small
    --languageSource language code or autoen
    --tasktranscribe or translatetranscribe
    --backendProcessing backendsimulstreaming
    --min-chunk-sizeMinimum audio chunk size (seconds)1.0
    --no-vacDisable Voice Activity ControllerFalse
    --no-vadDisable Voice Activity DetectionFalse
    --warmup-fileAudio file path for model warmupjfk.wav
    --hostServer host addresslocalhost
    --portServer port8000
    --ssl-certfilePath to the SSL certificate file (for HTTPS support)None
    --ssl-keyfilePath to the SSL private key file (for HTTPS support)None
    WhisperStreaming backend optionsDescriptionDefault
    --confidence-validationUse confidence scores for faster validationFalse
    --buffer_trimmingBuffer trimming strategy (sentence or segment)segment
    SimulStreaming backend optionsDescriptionDefault
    --frame-thresholdAlignAtt frame threshold (lower = faster, higher = more accurate)25
    --beamsNumber of beams for beam search (1 = greedy decoding)1
    --decoderForce decoder type (beam or greedy)auto
    --audio-max-lenMaximum audio buffer length (seconds)30.0
    --audio-min-lenMinimum audio length to process (seconds)0.0
    --cif-ckpt-pathPath to CIF model for word boundary detectionNone
    --never-fireNever truncate incomplete wordsFalse
    --init-promptInitial prompt for the modelNone
    --static-init-promptStatic prompt that doesn't scrollNone
    --max-context-tokensMaximum context tokensNone
    --model-pathDirect path to .pt model file. Download it if not found./base.pt
    --preloaded-model-countOptional. Number of models to preload in memory to speed up loading (set up to the expected number of concurrent users)1
    Diarization optionsDescriptionDefault
    --diarizationEnable speaker identificationFalse
    --diarization-backenddiart or sortformersortformer
    --punctuation-splitUse punctuation to improve speaker boundariesTrue
    --segmentation-modelHugging Face model ID for Diart segmentation model. Available modelspyannote/segmentation-3.0
    --embedding-modelHugging Face model ID for Diart embedding model. Available modelsspeechbrain/spkrec-ecapa-voxceleb

    🚀 Deployment Guide

    To deploy WhisperLiveKit in production:

    1. Server Setup: Install production ASGI server & launch with multiple workers

      pip install uvicorn gunicorn
      gunicorn -k uvicorn.workers.UvicornWorker -w 4 your_app:app
      
    2. Frontend: Host your customized version of the html example & ensure WebSocket connection points correctly

    3. Nginx Configuration (recommended for production):

      server {
         listen 80;
         server_name your-domain.com;
          location / {
              proxy_pass http://localhost:8000;
              proxy_set_header Upgrade $http_upgrade;
              proxy_set_header Connection "upgrade";
              proxy_set_header Host $host;
      }}
      
    4. HTTPS Support: For secure deployments, use "wss://" instead of "ws://" in WebSocket URL

    🐋 Docker

    Deploy the application easily using Docker with GPU or CPU support.

    Prerequisites

    • Docker installed on your system
    • For GPU support: NVIDIA Docker runtime installed

    Quick Start

    With GPU acceleration (recommended):

    docker build -t wlk .
    docker run --gpus all -p 8000:8000 --name wlk wlk
    

    CPU only:

    docker build -f Dockerfile.cpu -t wlk .
    docker run -p 8000:8000 --name wlk wlk
    

    Advanced Usage

    Custom configuration:

    # Example with custom model and language
    docker run --gpus all -p 8000:8000 --name wlk wlk --model large-v3 --language fr
    

    Memory Requirements

    • Large models: Ensure your Docker runtime has sufficient memory allocated

    Customization

    • --build-arg Options:
      • EXTRAS="whisper-timestamped" - Add extras to the image's installation (no spaces). Remember to set necessary container options!
      • HF_PRECACHE_DIR="./.cache/" - Pre-load a model cache for faster first-time start
      • HF_TKN_FILE="./token" - Add your Hugging Face Hub access token to download gated models

    🔮 Use Cases

    Capture discussions in real-time for meeting transcription, help hearing-impaired users follow conversations through accessibility tools, transcribe podcasts or videos automatically for content creation, transcribe support calls with speaker identification for customer service...

    Discover Repositories

    Search across tracked repositories by name or description