New: the Voice AI Investors list release! Check it out

    roomkit

    Git Repo
    roomkit-live

    RoomKit is an async Python framework for building voice AI agents inside multi-channel conversations, with local or cloud STT/TTS, barge-in, and speech-to-speech AI.

    About roomkit

    RoomKit is a Python framework that brings voice AI into multi-channel conversations, treating voice as one channel alongside SMS, email, and chat in the same room.

    • For the Non-Technical Reader:

      RoomKit lets you build voice assistants that don't live in isolation. Your voice agent joins a "room" alongside SMS, email, and chat, so a customer can start by voice and continue by text with full context. It supports cloud providers (Deepgram, ElevenLabs) and fully local pipelines on GPU, so your assistant can run without any cloud dependency.

    • For the Technical Reader:

      RoomKit provides two voice architectures: a STT/TTS pipeline (Deepgram, ElevenLabs, sherpa-onnx) with streaming VAD (Silero, TEN-VAD), noise suppression, echo cancellation, and barge-in, and a speech-to-speech mode via Gemini Live and OpenAI Realtime with tool calling.

      Both plug into the same room abstraction alongside SMS, email, WebSocket, and AI text channels.

      Hooks intercept ON_TRANSCRIPTION, BEFORE_TTS, ON_SPEECH_START, ON_BARGE_IN. Local pipelines achieve sub-300ms latency on RTX 4070 with Kroko ASR + Piper TTS + Ollama.

    • Why It Matters:

      Most voice AI frameworks treat voice as standalone. RoomKit treats it as one channel in a conversation. Your voice agent can hand off to SMS, escalate via email, or pull context from a previous chat without rebuilding routing logic. Local pipeline support addresses privacy and latency concerns in healthcare, finance, and enterprise.

    • The "Voice AI Space Lab" Idea:

      Build a fully local voice assistant (no API keys) using sherpa-onnx + Ollama, then attach an SMS channel to the same room so users switch between speaking and texting mid-conversation. Add MCP tools for document search by voice.