Interview
Sylvain Boily
Founder
RoomKit
How did you end up in voice?
It started in a basement in Puteaux, France, in 2004. Two friends and I founded XiVO, an open-source PBX built on Asterisk and I spent the next 20 years in telecom. But even early on, I was frustrated with how voice was handled. IVRs were terrible. Call routing was rigid. Conversations felt like obstacles, not experiences. I always felt we could do better but the technology wasn't ready. Around 2018, I built my first voice AI proof of concept during a hackathon. The potential was there but I was still thinking like a telecom guy. Voice as infrastructure, not as a natural interface. The real shift happened in late 2023. Voice AI wasn't just a telecom upgrade it was a paradigm change. That pushed me to build Angany from scratch. Not a pivot. Just 20 years of frustration finally meeting the right technology.
What's your struggle or moment of joy with voice?
My biggest struggle came early deploying my first real client with Swisscom phones that simply didn't work. We had to replace every single handset. The client had no working telephony for a period. That kind of failure stays with you. My joy moment came in 2018, at a hackathon with two teammates. We built a module to transcribe voice to text in a browser, in real time. Seeing it work, actually work, was electric. The technology was clearly moving somewhere. But honestly? 2026 is the most exciting year of my 30 years in software development. The shift is brutal and fast, but I love it. Voice is making a comeback as what it always was our most natural way to communicate. The possibilities feel infinite. I wake up every morning genuinely excited about what we're building.
Where do you think voice is going?
Speech-to-speech is a revolution. No more latency artifacts from the classic STT → LLM → TTS pipeline, just fluid, natural conversation. It also unlocks scenarios nearly impossible with the traditional approach. Think multi-speaker group conversations: an absolute nightmare with a pipeline, but speech-to-speech handles it naturally. That said, both architectures have their place. The pipeline gives you control and inspectability. Speech-to-speech gives you fluidity. The best systems will use both. Same logic applies to local vs. cloud. I've built fully local voice pipelines on a single GPU with sub-300ms response times, privacy, latency, and connectivity make edge AI essential in many contexts. But cloud has its strengths too. It's not about choosing sides. And voice combined with MCP and tool calling? That's where it gets really exciting, agents that don't just talk but actually act. I think this combination is going to be transformative.