VoiceBox: The Open-Source Voice Synthesis Studio That Keeps Your Data Local

Jeff Liu··6 min read·1 sources·GitHub
VoiceBox: The Open-Source Voice Synthesis Studio That Keeps Your Data Local
ListenVoiceBox: The Open-Source Voice Synthesis Studio That Keeps Your Data Local
0:00
--:--

Key Takeaways

  1. 1VoiceBox runs entirely on local hardware with no cloud dependency or per-character billing.
  2. 2Ships with 7 TTS engines including Qwen3-TTS, LuxTTS, Chatterbox, HumeAI TADA, and Kokoro.
  3. 3Supports voice cloning from short audio samples across 23 languages.
  4. 4Exposes a full REST API at localhost:17493 for programmatic speech generation and transcription.
  5. 5Includes a built-in MCP server for integration with Claude Code, Cursor, and other AI agents.

Most voice synthesis tools work the same way. You type text, upload a voice sample, and an API somewhere in the cloud generates audio. It works. But your voice data leaves your machine, your costs scale with every character, and you have limited control over what happens to the models trained on your input.

VoiceBox takes a different approach. It runs entirely on your hardware. No API calls, no per-character billing, no data leaving your machine.

Built by Jamie Pine, VoiceBox is a local-first AI voice studio that handles both sides of the voice loop: text-to-speech output and speech-to-text input. It has accumulated over 24,500 stars on GitHub and is currently on version 0.5.0, released April 2026. The project is open source under the MIT license.

What VoiceBox Does

At its core, VoiceBox is a desktop application built with Tauri (Rust) and FastAPI (Python). It ships with seven TTS engines, each with different strengths:
    • Qwen3-TTS and Qwen CustomVoice for natural-language delivery control
    • LuxTTS for lightweight generation (~1GB VRAM, 150x real-time on CPU)
    • Chatterbox Multilingual and Chatterbox Turbo for expressive speech with paralinguistic tags like [laugh], [sigh], and [gasp]
    • HumeAI TADA for emotional speech synthesis
    • Kokoro for 50+ curated preset voices
These engines support 23 languages, from English and Spanish to Arabic, Japanese, Hindi, and Swahili. Voice cloning works from a short audio sample, and you can switch engines per generation to find what sounds best for your content.

Beyond basic TTS, VoiceBox includes post-processing effects powered by Spotify's pedalboard library: pitch shift, reverb, delay, chorus, compression, and filters. There are four built-in presets (Robotic, Radio, Echo Chamber, Deep Voice) and you can create custom ones.

The Privacy Argument

With cloud-based voice cloning services, your voice data is uploaded to external servers. Depending on the provider's terms of service, that data may be used to train or improve their models. Once uploaded, you lose visibility into how it is processed or stored.

VoiceBox sidesteps this entirely. Models download once and run locally. Voice profiles, generated audio, and capture recordings stay in your data directory. Nothing phones home.

For creators building content at scale, or anyone cloning a voice they care about protecting, this distinction matters.

How to Install VoiceBox

macOS (Apple Silicon)

    • Download the DMG from voicebox.sh/download/mac-arm
    • Drag VoiceBox to your Applications folder
    • On first launch, grant the required Accessibility and Input Monitoring permissions when prompted

macOS (Intel)

Windows

Docker

For headless or server deployments:

docker compose up

Building From Source

If you want to run the latest development version:

git clone https://github.com/jamiepine/voicebox.git
cd voicebox
just setup    # creates Python venv, installs all deps
just dev      # starts backend + desktop app

Prerequisites: Bun, Rust, Python 3.11+, Tauri prerequisites, and Xcode on macOS. Install just via brew install just or cargo install just.

How to Clone Your First Voice

    • Open VoiceBox and navigate to the Profiles section
    • Click Create Profile
    • Either upload an audio file of the voice you want to clone, or record directly in the app
    • Give the profile a name and optional description
    • Select a TTS engine (Qwen3-TTS is a good starting point for general use; LuxTTS if you want fast, lightweight generation)
    • Type your text in the generation box and click Generate
    • Preview the output, apply effects if needed, and export
VoiceBox supports multi-sample profiles for higher quality cloning. The more reference audio you provide, the closer the generated voice matches the original.

Working With the REST API

VoiceBox exposes a REST API at http://127.0.0.1:17493 for integrating voice generation into your own applications, scripts, and pipelines.

Generate Speech

curl -X POST http://127.0.0.1:17493/generate \
  -H "Content-Type: application/json" \
  -d '{ "text": "Hello world", "profile_id": "abc123", "language": "en" }'

Agent Voice Output

Any application or script can trigger voice output through a cloned profile:

curl -X POST http://127.0.0.1:17493/speak \
  -H "Content-Type: application/json" \
  -H "X-Voicebox-Client-Id: my-script" \
  -d '{ "text": "Deploy complete.", "profile": "Morgan" }'

Transcribe Audio

curl -X POST http://127.0.0.1:17493/transcribe \
  -F "audio=@recording.wav" \
  -F "model=whisper-turbo"

List Voice Profiles

curl http://127.0.0.1:17493/profiles

Full API documentation is available at http://127.0.0.1:17493/docs when the app is running.

MCP Server for AI Agents

VoiceBox ships with a built-in Model Context Protocol (MCP) server. Any MCP-aware agent like Claude Code, Cursor, or Windsurf can speak, transcribe, and browse voice profiles directly.

Claude Code Setup (one line)

claude mcp add voicebox \
  --transport http \
  --url http://127.0.0.1:17493/mcp \
  --header "X-Voicebox-Client-Id: claude-code"

Cursor / Windsurf / VS Code

Add to your MCP config:

{
  "mcpServers": {
    "voicebox": {
      "url": "http://127.0.0.1:17493/mcp",
      "headers": {
        "X-Voicebox-Client-Id": "cursor"
      }
    }
  }
}

Four MCP tools are available: voicebox.speak, voicebox.transcribe, voicebox.list_captures, and voicebox.list_profiles. You can bind specific voice profiles to specific agents in Settings, so Claude Code uses one voice and Cursor uses another.

The Stories Editor

For longer-form audio like podcasts, conversations, or narrative content, VoiceBox includes a multi-track timeline editor. You can compose multi-voice projects with drag-and-drop, trim and split audio inline, pin specific generation versions per track clip, and export the composed timeline.

This is useful for anyone producing audio content that involves more than one voice or requires precise timing control.

How VoiceBox Compares to Cloud Alternatives

FeatureVoiceBoxElevenLabsWisprFlow
Voice cloningLocal, on-deviceCloud APIN/A
Speech-to-textLocal WhisperCloud APICloud API
Cost modelFree (your hardware)Per-character billingSubscription
Data privacyAll data stays localData uploaded to cloudData uploaded to cloud
TTS engines7 engines, switchableProprietary modelsN/A
MCP integrationBuilt-inNoNo
Open sourceMIT licenseProprietaryProprietary
Platform supportmacOS, Windows, Linux, DockerWeb, APImacOS

VoiceBox is not a direct replacement for every use case. Cloud services offer higher-fidelity voices out of the box and require zero hardware configuration. But for workflows where data ownership, cost control, and API flexibility matter, VoiceBox fills a gap that cloud providers do not address.

Who This Is For

    • Content creators producing narrated articles, podcasts, or video voiceovers at scale without per-character costs
    • Developers integrating voice I/O into applications via REST API or MCP
    • AI agent builders who want their agents to speak in cloned voices
    • Privacy-conscious teams that cannot send voice data to third-party servers
    • Accessibility projects building voice synthesis tools for people who can't speak in their original voice
The project is actively maintained with 588 commits, 25 releases, and an open roadmap that includes additional STT engines and platform-specific improvements.

Source: github.com/jamiepine/voicebox
Website: voicebox.sh
Docs: docs.voicebox.sh

What This Means For You

1

Replace cloud TTS costs with local inference

VoiceBox eliminates per-character billing by running voice synthesis on local hardware. For teams generating audio at scale, this can reduce voice generation costs to zero after the initial hardware investment.

2

Add voice I/O to AI agent workflows

The built-in MCP server and REST API let you add voice input and output to any MCP-aware agent or custom application. Agents can speak through cloned voice profiles and transcribe audio without external API dependencies.

3

Keep voice data under your control

Unlike cloud voice services that upload voice samples to external servers, VoiceBox processes everything locally. This is relevant for teams handling sensitive voice data or operating under data residency requirements.

FAQ

VoiceBox is an open-source, local-first AI voice studio built with Tauri and FastAPI. It handles text-to-speech, speech-to-text, and voice cloning entirely on your hardware. The project has over 24,500 GitHub stars and is licensed under MIT.

VoiceBox ships with 7 TTS engines: Qwen3-TTS, Qwen CustomVoice, LuxTTS, Chatterbox Multilingual, Chatterbox Turbo, HumeAI TADA, and Kokoro. Each engine has different strengths, from lightweight CPU generation to expressive emotional speech.

All models download once and run locally. Voice profiles, generated audio, and recordings stay in your local data directory. No data is sent to external servers. This contrasts with cloud services where voice data is uploaded and may be used to train provider models.

Yes. VoiceBox exposes a REST API at http://127.0.0.1:17493 with endpoints for speech generation, voice output, transcription, and profile management. Full API documentation is available at /docs when the app is running.

VoiceBox includes a built-in MCP (Model Context Protocol) server. AI agents like Claude Code, Cursor, and Windsurf can speak, transcribe, and list voice profiles directly. You can bind different voice profiles to different agents.

VoiceBox runs on macOS (Apple Silicon and Intel), Windows, Linux, and Docker. macOS uses MLX and Metal for GPU acceleration, Windows uses CUDA, and Docker enables headless server deployments.

Open VoiceBox, go to Profiles, click Create Profile, upload or record a voice sample, select a TTS engine like Qwen3-TTS, type your text, and click Generate. Multi-sample profiles produce higher quality clones.

Related Articles

More insights on trending topics and technology

Newsletter

We read 100+ sources so you don't have to.

One email. Delivered weekly. The AI and tech stories actually worth your time.