speech2speech_fullduplex
Implements local full-duplex voice interaction with barge-in, supporting multiple English and Hindi personalities using Whisper, Ollama, and Kokoro.
About speech2speech_fullduplex
Piklu is an open-source framework designed for building local, full-duplex voice agents that support natural, real-time interactions including barge-in capabilities. By leveraging a local stack for speech-to-text, large language models, and text-to-speech, it enables developers to create conversational AI that can listen and speak simultaneously in both English and Hindi.
For the Non-Technical Reader
Imagine talking to a computer the same way you talk to a friend. Usually, AI voice assistants require you to wait for them to finish speaking before you can say something new. Piklu changes this by allowing for "barge-in"—you can interrupt the AI mid-sentence, and it will stop and listen, just like a human would. Because it runs entirely on your own hardware, your voice data stays private, and the conversation feels fluid rather than robotic. It is essentially a "brain in a box" for your computer that can take on different personalities, from a helpful assistant to a specific character.
For the Technical Reader
The architecture of this repository is built for low-latency, full-duplex communication using a robust local stack:
- Streaming Infrastructure: Utilizes LiveKit for handling real-time audio streams between the browser and the backend.
- STT (Speech-to-Text): Powered by Faster-Whisper for high-speed transcription.
- LLM (Large Language Model): Integrates Llama-3-8B via llama-cpp-python, allowing for local inference with configurable speed and quality.
- TTS (Text-to-Speech): Uses the Kokoro engine for high-quality, multi-lingual voice synthesis.
- Barge-in Logic: Implements Acoustic Echo Cancellation (AEC) and fine-tuned Voice Activity Detection (VAD) to ensure the agent correctly identifies user speech even while it is playing back audio.
- Deployment: Fully containerized with Docker, supporting NVIDIA GPU acceleration via the NVIDIA Container Toolkit.
Why It Matters
This project represents a significant shift toward sovereign AI. While proprietary models offer impressive voice capabilities, they often come with high latency, subscription costs, and privacy concerns. By providing a blueprint for a full-duplex system using open-source components like Whisper and Llama, this repository democratizes access to high-end voice interface technology. It proves that sophisticated features like echo cancellation and real-time interruption are no longer exclusive to big-tech cloud platforms.
The Voice AI Space Lab Idea
Using this framework, you could build a "Real-Time Multilingual Negotiator." Imagine a scenario where two people speaking different languages (English and Hindi) are trying to reach an agreement. The agent acts as a live, full-duplex mediator that not only translates but also uses its LLM "personality" to suggest compromises in real-time. Because of the barge-in feature, users can correct the translation or add context instantly without waiting for a processing cycle, creating a truly seamless cross-lingual negotiation table.
Explore the repository here: https://github.com/sankar-mukherjee/speech2speech_fullduplex