VoiceBox is an open-source, local-first AI voice studio built with Tauri and FastAPI. It handles text-to-speech, speech-to-text, and voice cloning entirely on your hardware. The project has over 24,500 GitHub stars and is licensed under MIT.

How many TTS engines does VoiceBox support?

VoiceBox ships with 7 TTS engines: Qwen3-TTS, Qwen CustomVoice, LuxTTS, Chatterbox Multilingual, Chatterbox Turbo, HumeAI TADA, and Kokoro. Each engine has different strengths, from lightweight CPU generation to expressive emotional speech.

How does VoiceBox protect voice data privacy?

All models download once and run locally. Voice profiles, generated audio, and recordings stay in your local data directory. No data is sent to external servers. This contrasts with cloud services where voice data is uploaded and may be used to train provider models.

Can I integrate VoiceBox into my own applications?

Yes. VoiceBox exposes a REST API at http://127.0.0.1:17493 with endpoints for speech generation, voice output, transcription, and profile management. Full API documentation is available at /docs when the app is running.

Does VoiceBox work with AI coding agents?

VoiceBox includes a built-in MCP (Model Context Protocol) server. AI agents like Claude Code, Cursor, and Windsurf can speak, transcribe, and list voice profiles directly. You can bind different voice profiles to different agents.

What platforms does VoiceBox support?

VoiceBox runs on macOS (Apple Silicon and Intel), Windows, Linux, and Docker. macOS uses MLX and Metal for GPU acceleration, Windows uses CUDA, and Docker enables headless server deployments.

How do I clone a voice with VoiceBox?

Open VoiceBox, go to Profiles, click Create Profile, upload or record a voice sample, select a TTS engine like Qwen3-TTS, type your text, and click Generate. Multi-sample profiles produce higher quality clones.

VoiceBox: The Open-Source Voice Synthesis Studio That Keeps Your Data Local

Most voice synthesis tools work the same way. You type text, upload a voice sample, and an API somewhere in the cloud generates audio. It works. But your voice data leaves your machine, your costs scale with every character, and you have limited control over what happens to the models trained on your input.

VoiceBox takes a different approach. It runs entirely on your hardware. No API calls, no per-character billing, no data leaving your machine.

Built by Jamie Pine, VoiceBox is a local-first AI voice studio that handles both sides of the voice loop: text-to-speech output and speech-to-text input. It has accumulated over 24,500 stars on GitHub and is currently on version 0.5.0, released April 2026. The project is open source under the MIT license.

What VoiceBox Does

At its core, VoiceBox is a desktop application built with Tauri (Rust) and FastAPI (Python). It ships with seven TTS engines, each with different strengths:

Qwen3-TTS and Qwen CustomVoice for natural-language delivery control
LuxTTS for lightweight generation (~1GB VRAM, 150x real-time on CPU)
Chatterbox Multilingual and Chatterbox Turbo for expressive speech with paralinguistic tags like [laugh], [sigh], and [gasp]
HumeAI TADA for emotional speech synthesis
Kokoro for 50+ curated preset voices

These engines support 23 languages, from English and Spanish to Arabic, Japanese, Hindi, and Swahili. Voice cloning works from a short audio sample, and you can switch engines per generation to find what sounds best for your content.

Beyond basic TTS, VoiceBox includes post-processing effects powered by Spotify's pedalboard library: pitch shift, reverb, delay, chorus, compression, and filters. There are four built-in presets (Robotic, Radio, Echo Chamber, Deep Voice) and you can create custom ones.

The Privacy Argument

With cloud-based voice cloning services, your voice data is uploaded to external servers. Depending on the provider's terms of service, that data may be used to train or improve their models. Once uploaded, you lose visibility into how it is processed or stored.

VoiceBox sidesteps this entirely. Models download once and run locally. Voice profiles, generated audio, and capture recordings stay in your data directory. Nothing phones home.

For creators building content at scale, or anyone cloning a voice they care about protecting, this distinction matters.

How to Install VoiceBox

macOS (Apple Silicon)

Download the DMG from voicebox.sh/download/mac-arm
Drag VoiceBox to your Applications folder
On first launch, grant the required Accessibility and Input Monitoring permissions when prompted

macOS (Intel)

Download the DMG from voicebox.sh/download/mac-intel
Same installation and permissions process

Windows

Download the MSI installer from voicebox.sh/download/windows
Run the installer and follow the prompts

Docker

For headless or server deployments:

docker compose up

Building From Source

If you want to run the latest development version:

git clone https://github.com/jamiepine/voicebox.git
cd voicebox
just setup    # creates Python venv, installs all deps
just dev      # starts backend + desktop app

Prerequisites: Bun, Rust, Python 3.11+, Tauri prerequisites, and Xcode on macOS. Install just via brew install just or cargo install just.

How to Clone Your First Voice

Open VoiceBox and navigate to the Profiles section
Click Create Profile
Either upload an audio file of the voice you want to clone, or record directly in the app
Give the profile a name and optional description
Select a TTS engine (Qwen3-TTS is a good starting point for general use; LuxTTS if you want fast, lightweight generation)
Type your text in the generation box and click Generate
Preview the output, apply effects if needed, and export

VoiceBox supports multi-sample profiles for higher quality cloning. The more reference audio you provide, the closer the generated voice matches the original.

Working With the REST API

VoiceBox exposes a REST API at http://127.0.0.1:17493 for integrating voice generation into your own applications, scripts, and pipelines.

Generate Speech

curl -X POST http://127.0.0.1:17493/generate \
  -H "Content-Type: application/json" \
  -d '{ "text": "Hello world", "profile_id": "abc123", "language": "en" }'

Agent Voice Output

Any application or script can trigger voice output through a cloned profile:

curl -X POST http://127.0.0.1:17493/speak \
  -H "Content-Type: application/json" \
  -H "X-Voicebox-Client-Id: my-script" \
  -d '{ "text": "Deploy complete.", "profile": "Morgan" }'

Transcribe Audio

curl -X POST http://127.0.0.1:17493/transcribe \
  -F "audio=@recording.wav" \
  -F "model=whisper-turbo"

List Voice Profiles

curl http://127.0.0.1:17493/profiles

Full API documentation is available at http://127.0.0.1:17493/docs when the app is running.

MCP Server for AI Agents

VoiceBox ships with a built-in Model Context Protocol (MCP) server. Any MCP-aware agent like Claude Code, Cursor, or Windsurf can speak, transcribe, and browse voice profiles directly.

Claude Code Setup (one line)

claude mcp add voicebox \
  --transport http \
  --url http://127.0.0.1:17493/mcp \
  --header "X-Voicebox-Client-Id: claude-code"

Cursor / Windsurf / VS Code

Add to your MCP config:

{
  "mcpServers": {
    "voicebox": {
      "url": "http://127.0.0.1:17493/mcp",
      "headers": {
        "X-Voicebox-Client-Id": "cursor"
      }
    }
  }
}

Four MCP tools are available: voicebox.speak, voicebox.transcribe, voicebox.list_captures, and voicebox.list_profiles. You can bind specific voice profiles to specific agents in Settings, so Claude Code uses one voice and Cursor uses another.

The Stories Editor

For longer-form audio like podcasts, conversations, or narrative content, VoiceBox includes a multi-track timeline editor. You can compose multi-voice projects with drag-and-drop, trim and split audio inline, pin specific generation versions per track clip, and export the composed timeline.

This is useful for anyone producing audio content that involves more than one voice or requires precise timing control.

How VoiceBox Compares to Cloud Alternatives

Feature	VoiceBox	ElevenLabs	WisprFlow
Voice cloning	Local, on-device	Cloud API	N/A
Speech-to-text	Local Whisper	Cloud API	Cloud API
Cost model	Free (your hardware)	Per-character billing	Subscription
Data privacy	All data stays local	Data uploaded to cloud	Data uploaded to cloud
TTS engines	7 engines, switchable	Proprietary models	N/A
MCP integration	Built-in	No	No
Open source	MIT license	Proprietary	Proprietary
Platform support	macOS, Windows, Linux, Docker	Web, API	macOS

VoiceBox is not a direct replacement for every use case. Cloud services offer higher-fidelity voices out of the box and require zero hardware configuration. But for workflows where data ownership, cost control, and API flexibility matter, VoiceBox fills a gap that cloud providers do not address.

Who This Is For

Content creators producing narrated articles, podcasts, or video voiceovers at scale without per-character costs
Developers integrating voice I/O into applications via REST API or MCP
AI agent builders who want their agents to speak in cloned voices
Privacy-conscious teams that cannot send voice data to third-party servers
Accessibility projects building voice synthesis tools for people who can't speak in their original voice

The project is actively maintained with 588 commits, 25 releases, and an open roadmap that includes additional STT engines and platform-specific improvements.

Source: github.com/jamiepine/voicebox
Website: voicebox.sh
Docs: docs.voicebox.sh

VoiceBox: The Open-Source Voice Synthesis Studio That Keeps Your Data Local

Key Takeaways

What VoiceBox Does

The Privacy Argument

How to Install VoiceBox

macOS (Apple Silicon)

macOS (Intel)

Windows

Docker

Building From Source

How to Clone Your First Voice

Working With the REST API

Generate Speech

Agent Voice Output

Transcribe Audio

List Voice Profiles

MCP Server for AI Agents

Claude Code Setup (one line)

Cursor / Windsurf / VS Code

The Stories Editor

How VoiceBox Compares to Cloud Alternatives

Who This Is For

What This Means For You

FAQFrequently Asked Questions

Related Articles

Dexter Masters Deep Financial Research

Anthropic, OpenAI Unleash Enterprise AI Ventures

Mistral AI Fuels Vibe Remote Agents with Medium 3.5

Pentagon secures 8 AI deals for secret networks

Buildkite: Pricing That Scales With You.

Local Deep Research Delivers 95% Private Accuracy

Scrapling: Master Web Scraping at Any Scale

Windows 11: RAM Hog Forces Optimizers. Microsoft Fix It!

VoiceBox: The Open-Source Voice Synthesis Studio That Keeps Your Data Local

Key Takeaways

What VoiceBox Does

The Privacy Argument

How to Install VoiceBox

macOS (Apple Silicon)

macOS (Intel)

Windows

Docker

Building From Source

How to Clone Your First Voice

Working With the REST API

Generate Speech

Agent Voice Output

Transcribe Audio

List Voice Profiles

MCP Server for AI Agents

Claude Code Setup (one line)

Cursor / Windsurf / VS Code

The Stories Editor

How VoiceBox Compares to Cloud Alternatives

Who This Is For

What This Means For You

FAQFrequently Asked Questions

Related Articles

Dexter Masters Deep Financial Research

Anthropic, OpenAI Unleash Enterprise AI Ventures

Mistral AI Fuels Vibe Remote Agents with Medium 3.5

Pentagon secures 8 AI deals for secret networks

Buildkite: Pricing That Scales With You.

Local Deep Research Delivers 95% Private Accuracy

Scrapling: Master Web Scraping at Any Scale

Windows 11: RAM Hog Forces Optimizers. Microsoft Fix It!

We read 100+ sources so you don't have to.