WhisperSubs
Provides real-time speech transcription and translation in browsers using OpenAI Realtime API with floating subtitle overlays and SRT exports.
About WhisperSubs
WhisperSubs: Real-Time Subtitles for the Modern Web
WhisperSubs is a browser-based application that leverages OpenAI's Realtime API to provide near-instant speech transcription and translation. By capturing audio directly from a browser tab or a microphone, it delivers live subtitles with sub-second latency, bridging the gap between spoken word and text across language barriers.
1. For the Non-Technical Reader
Imagine you are watching a live international conference or a YouTube video in a language you don't speak fluently. WhisperSubs acts like a digital interpreter that sits right on your screen. It creates a floating subtitle window that you can drag over any video player, showing you what is being said as it happens. Beyond just translating into 30+ languages, it includes a Live Chat feature where you can actually ask questions about the conversation, such as "What was the main point of the last five minutes?", and get an AI-generated summary instantly.
2. For the Technical Reader
The architecture of WhisperSubs is built to minimize the "latency tax" typically found in standard Whisper-based pipelines. Key technical specifications include:
- Engine: Powered by the OpenAI Realtime API for word-by-word streaming.
- Audio Capture: Utilizes Chromium-based tab audio capture and standard microphone input.
- Translation & Logic: Employs GPT-4o-mini for high-speed, low-cost translation and live session summarization.
- VAD Controls: Advanced Voice Activity Detection (VAD) settings allow developers to tune speech sensitivity and silence duration (ms) to optimize turn-taking.
- Output: Supports SRT export for timestamped subtitle files and a detachable popup window for the UI overlay.
- Requirements: Python 3.11+ and an OpenAI API key with Realtime API access.
3. Why It Matters
WhisperSubs represents a significant shift from batch processing to real-time interaction. In the Voice AI industry, the move toward sub-second latency is critical for accessibility and global collaboration. By utilizing an MIT License, this repository provides a blueprint for developers to build low-latency translation tools without needing to manage complex local GPU clusters, relying instead on high-performance API endpoints to handle the heavy lifting of speech-to-text (STT).
4. The Voice AI Space Lab Idea
Why not build a "Global Classroom Assistant"? Using WhisperSubs as a foundation, a developer could create a tool for international students that not only provides live translations of a lecture but also automatically generates a real-time glossary of technical terms mentioned by the professor. As the lecture progresses, the AI could flag complex concepts in the chat sidebar, providing Wikipedia links or simplified explanations without the student ever having to leave the video tab.
You can explore the repository here: https://github.com/stzifkas/WhisperSubs