New: the Voice AI Investors list release! Check it out

    speech2speech_fullduplex

    Git Repo
    sankar-mukherjee

    Implements local full-duplex voice interaction with barge-in, supporting multiple English and Hindi personalities using Whisper, Ollama, and Kokoro.

    About speech2speech_fullduplex

    Piklu is an open-source framework designed for building local, full-duplex voice agents that support natural, real-time interactions including barge-in capabilities. By leveraging a local stack for speech-to-text, large language models, and text-to-speech, it enables developers to create conversational AI that can listen and speak simultaneously in both English and Hindi.

    For the Non-Technical Reader

    Imagine talking to a computer the same way you talk to a friend. Usually, AI voice assistants require you to wait for them to finish speaking before you can say something new. Piklu changes this by allowing for "barge-in"—you can interrupt the AI mid-sentence, and it will stop and listen, just like a human would. Because it runs entirely on your own hardware, your voice data stays private, and the conversation feels fluid rather than robotic. It is essentially a "brain in a box" for your computer that can take on different personalities, from a helpful assistant to a specific character.

    For the Technical Reader

    The architecture of this repository is built for low-latency, full-duplex communication using a robust local stack:

    • Streaming Infrastructure: Utilizes LiveKit for handling real-time audio streams between the browser and the backend.
    • STT (Speech-to-Text): Powered by Faster-Whisper for high-speed transcription.
    • LLM (Large Language Model): Integrates Llama-3-8B via llama-cpp-python, allowing for local inference with configurable speed and quality.
    • TTS (Text-to-Speech): Uses the Kokoro engine for high-quality, multi-lingual voice synthesis.
    • Barge-in Logic: Implements Acoustic Echo Cancellation (AEC) and fine-tuned Voice Activity Detection (VAD) to ensure the agent correctly identifies user speech even while it is playing back audio.
    • Deployment: Fully containerized with Docker, supporting NVIDIA GPU acceleration via the NVIDIA Container Toolkit.

    Why It Matters

    This project represents a significant shift toward sovereign AI. While proprietary models offer impressive voice capabilities, they often come with high latency, subscription costs, and privacy concerns. By providing a blueprint for a full-duplex system using open-source components like Whisper and Llama, this repository democratizes access to high-end voice interface technology. It proves that sophisticated features like echo cancellation and real-time interruption are no longer exclusive to big-tech cloud platforms.

    The Voice AI Space Lab Idea

    Using this framework, you could build a "Real-Time Multilingual Negotiator." Imagine a scenario where two people speaking different languages (English and Hindi) are trying to reach an agreement. The agent acts as a live, full-duplex mediator that not only translates but also uses its LLM "personality" to suggest compromises in real-time. Because of the barge-in feature, users can correct the translation or add context instantly without waiting for a processing cycle, creating a truly seamless cross-lingual negotiation table.

    Explore the repository here: https://github.com/sankar-mukherjee/speech2speech_fullduplex