jamiepine

    jamiepine/voicebox

    The open-source voice synthesis studio

    ai
    cuda
    mlx
    qwen3-tts
    qwen3-tts-ui
    voice-ai
    voice-clone
    whisper
    TypeScript
    MIT
    18.3K stars
    2.1K forks
    18.3K watching
    Updated 4/16/2026
    View on GitHub
    Backblaze Advertisement

    Loading star history...

    Health Score

    24.01

    Weekly Growth

    +0

    +0.0% this week

    Contributors

    1

    Total contributors

    Open Issues

    230

    Generated Insights

    About voicebox

    Voicebox

    Voicebox

    The open-source voice synthesis studio.
    Clone voices. Generate speech. Apply effects. Build voice-powered apps.
    All running locally on your machine.

    Downloads Release Stars License

    voicebox.shDocsDownloadFeaturesAPI


    Voicebox App Screenshot

    Click the image above to watch the demo video on voicebox.sh


    Voicebox Screenshot 2

    Voicebox Screenshot 3


    What is Voicebox?

    Voicebox is a local-first voice cloning studio — a free and open-source alternative to ElevenLabs. Clone voices from a few seconds of audio, generate speech in 23 languages across 5 TTS engines, apply post-processing effects, and compose multi-voice projects with a timeline editor.

    • Complete privacy — models and voice data stay on your machine
    • 5 TTS engines — Qwen3-TTS, LuxTTS, Chatterbox Multilingual, Chatterbox Turbo, and HumeAI TADA
    • 23 languages — from English to Arabic, Japanese, Hindi, Swahili, and more
    • Post-processing effects — pitch shift, reverb, delay, chorus, compression, and filters
    • Expressive speech — paralinguistic tags like [laugh], [sigh], [gasp] via Chatterbox Turbo
    • Unlimited length — auto-chunking with crossfade for scripts, articles, and chapters
    • Stories editor — multi-track timeline for conversations, podcasts, and narratives
    • API-first — REST API for integrating voice synthesis into your own projects
    • Native performance — built with Tauri (Rust), not Electron
    • Runs everywhere — macOS (MLX/Metal), Windows (CUDA), Linux, AMD ROCm, Intel Arc, Docker

    Download

    PlatformDownload
    macOS (Apple Silicon)Download DMG
    macOS (Intel)Download DMG
    WindowsDownload MSI
    Dockerdocker compose up

    View all binaries →

    Linux — Pre-built binaries are not yet available. See voicebox.sh/linux-install for build-from-source instructions.


    Features

    Multi-Engine Voice Cloning

    Five TTS engines with different strengths, switchable per-generation:

    EngineLanguagesStrengths
    Qwen3-TTS (0.6B / 1.7B)10High-quality multilingual cloning, delivery instructions ("speak slowly", "whisper")
    LuxTTSEnglishLightweight (~1GB VRAM), 48kHz output, 150x realtime on CPU
    Chatterbox Multilingual23Broadest language coverage — Arabic, Danish, Finnish, Greek, Hebrew, Hindi, Malay, Norwegian, Polish, Swahili, Swedish, Turkish and more
    Chatterbox TurboEnglishFast 350M model with paralinguistic emotion/sound tags
    TADA (1B / 3B)10HumeAI speech-language model — 700s+ coherent audio, text-acoustic dual alignment

    Emotions & Paralinguistic Tags

    Type / in the text input to insert expressive tags that the model synthesizes inline with speech (Chatterbox Turbo):

    [laugh] [chuckle] [gasp] [cough] [sigh] [groan] [sniff] [shush] [clear throat]

    Post-Processing Effects

    8 audio effects powered by Spotify's pedalboard library. Apply after generation, preview in real time, build reusable presets.

    EffectDescription
    Pitch ShiftUp or down by up to 12 semitones
    ReverbConfigurable room size, damping, wet/dry mix
    DelayEcho with adjustable time, feedback, and mix
    Chorus / FlangerModulated delay for metallic or lush textures
    CompressorDynamic range compression
    GainVolume adjustment (-40 to +40 dB)
    High-Pass FilterRemove low frequencies
    Low-Pass FilterRemove high frequencies

    Ships with 4 built-in presets (Robotic, Radio, Echo Chamber, Deep Voice) and supports custom presets. Effects can be assigned per-profile as defaults.

    Unlimited Generation Length

    Text is automatically split at sentence boundaries and each chunk is generated independently, then crossfaded together. Works with all engines.

    • Configurable auto-chunking limit (100–5,000 chars)
    • Crossfade slider (0–200ms) for smooth transitions
    • Max text length: 50,000 characters
    • Smart splitting respects abbreviations, CJK punctuation, and [tags]

    Generation Versions

    Every generation supports multiple versions with provenance tracking:

    • Original — clean TTS output, always preserved
    • Effects versions — apply different effects chains from any source version
    • Takes — regenerate with a new seed for variation
    • Source tracking — each version records its lineage
    • Favorites — star generations for quick access

    Async Generation Queue

    Generation is non-blocking. Submit and immediately start typing the next one.

    • Serial execution queue prevents GPU contention
    • Real-time SSE status streaming
    • Failed generations can be retried
    • Stale generations from crashes auto-recover on startup

    Voice Profile Management

    • Create profiles from audio files or record directly in-app
    • Import/export profiles to share or back up
    • Multi-sample support for higher quality cloning
    • Per-profile default effects chains
    • Organize with descriptions and language tags

    Stories Editor

    Multi-voice timeline editor for conversations, podcasts, and narratives.

    • Multi-track composition with drag-and-drop
    • Inline audio trimming and splitting
    • Auto-playback with synchronized playhead
    • Version pinning per track clip

    Recording & Transcription

    • In-app recording with waveform visualization
    • System audio capture (macOS and Windows)
    • Automatic transcription powered by Whisper (including Whisper Turbo)
    • Export recordings in multiple formats

    Model Management

    • Per-model unload to free GPU memory without deleting downloads
    • Custom models directory via VOICEBOX_MODELS_DIR
    • Model folder migration with progress tracking
    • Download cancel/clear UI

    GPU Support

    PlatformBackendNotes
    macOS (Apple Silicon)MLX (Metal)4-5x faster via Neural Engine
    Windows / Linux (NVIDIA)PyTorch (CUDA)Auto-downloads CUDA binary from within the app
    Linux (AMD)PyTorch (ROCm)Auto-configures HSA_OVERRIDE_GFX_VERSION
    Windows (any GPU)DirectMLUniversal Windows GPU support
    Intel ArcIPEX/XPUIntel discrete GPU acceleration
    AnyCPUWorks everywhere, just slower

    API

    Voicebox exposes a full REST API for integrating voice synthesis into your own apps.

    # Generate speech
    curl -X POST http://localhost:17493/generate \
      -H "Content-Type: application/json" \
      -d '{"text": "Hello world", "profile_id": "abc123", "language": "en"}'
    
    # List voice profiles
    curl http://localhost:17493/profiles
    
    # Create a profile
    curl -X POST http://localhost:17493/profiles \
      -H "Content-Type: application/json" \
      -d '{"name": "My Voice", "language": "en"}'
    

    Use cases: game dialogue, podcast production, accessibility tools, voice assistants, content automation.

    Full API documentation available at http://localhost:17493/docs.


    Tech Stack

    LayerTechnology
    Desktop AppTauri (Rust)
    FrontendReact, TypeScript, Tailwind CSS
    StateZustand, React Query
    BackendFastAPI (Python)
    TTS EnginesQwen3-TTS, LuxTTS, Chatterbox, Chatterbox Turbo, TADA
    EffectsPedalboard (Spotify)
    TranscriptionWhisper / Whisper Turbo (PyTorch or MLX)
    InferenceMLX (Apple Silicon) / PyTorch (CUDA/ROCm/XPU/CPU)
    DatabaseSQLite
    AudioWaveSurfer.js, librosa

    Roadmap

    FeatureDescription
    Real-time StreamingStream audio as it generates, word by word
    Voice DesignCreate new voices from text descriptions
    More ModelsXTTS, Bark, and other open-source voice models
    Plugin ArchitectureExtend with custom models and effects
    Mobile CompanionControl Voicebox from your phone

    Development

    See CONTRIBUTING.md for detailed setup and contribution guidelines.

    Quick Start

    git clone https://github.com/jamiepine/voicebox.git
    cd voicebox
    
    just setup   # creates Python venv, installs all deps
    just dev     # starts backend + desktop app
    

    Install just: brew install just or cargo install just. Run just --list to see all commands.

    Prerequisites: Bun, Rust, Python 3.11+, Tauri Prerequisites, and Xcode on macOS.

    Building Locally

    just build          # Build CPU server binary + Tauri app
    just build-local    # (Windows) Build CPU + CUDA server binaries + Tauri app
    

    Adding New Voice Models

    The multi-engine architecture makes adding new TTS engines straightforward. A step-by-step guide covers the full process: dependency research, backend protocol implementation, frontend wiring, and PyInstaller bundling.

    The guide is optimized for AI coding agents. An agent skill can pick up a model name and handle the entire integration autonomously — you just test the build locally.

    Project Structure

    voicebox/
    ├── app/              # Shared React frontend
    ├── tauri/            # Desktop app (Tauri + Rust)
    ├── web/              # Web deployment
    ├── backend/          # Python FastAPI server
    ├── landing/          # Marketing website
    └── scripts/          # Build & release scripts
    

    Contributing

    Contributions welcome! See CONTRIBUTING.md for guidelines.

    1. Fork the repo
    2. Create a feature branch
    3. Make your changes
    4. Submit a PR

    Security

    Found a security vulnerability? Please report it responsibly. See SECURITY.md for details.


    License

    MIT License — see LICENSE for details.


    voicebox.sh

    Discover Repositories

    Search across tracked repositories by name or description