Month 5 Week 4: Five Month Reflexion - From Idea to Real-Time Communication
November 23, 2025
Month 5 Week 4: Five Month Reflexion
Five months ago, there was an idea. Now there's a system that can have real-time conversations with users through audio. This post reflects on what that journey looked like—the decision points, the dead ends, the breakthroughs, and what it took to go from concept to a working, optimized product.
The Initial Vision
The starting point was clear in intent but vague in execution: build an AI system that can communicate with users in real-time through voice. Not text-based. Not batched. Real-time—where the AI listens, understands, and responds naturally within the flow of conversation.
That's straightforward to state. The actual implementation required navigating several layers of complexity that weren't obvious at the start.
The Design Phase: Laying Out the Architecture
Early on, we created design documents that outlined the pipeline:
- Audio Input → Capture user voice
- Voice Activity Detection (VAD) → Determine when the user is speaking
- Speech-to-Text → Transcribe audio with Whisper
- Language Model → Generate a contextual response
- Text-to-Speech → Convert response back to audio
- Audio Output → Play to user
This looks logical on paper. The challenge was making each stage fast enough that the cumulative latency felt natural. A user won't tolerate 5 seconds of silence after speaking before hearing the first response.
Wireframing the System
Before writing code, we mapped out:
- How audio would flow through the system
- Where bottlenecks might appear
- What components needed to talk to each other
- Where we could parallelize processing
The Stasis application emerged as the orchestrator—the piece that would manage audio streams, coordinate with the GPU cluster, and keep everything in sync. This was critical because without a clear architectural center, the system would have been a tangle of interconnected services with timing issues.
The Coding Phase: Building the Pipeline
Implementation started with the fundamentals:
- Set up a Kubernetes cluster to run the AI models
- Created a Python service to handle Whisper inference
- Built the Stasis application to orchestrate audio flow
- Integrated the LLM for response generation
- Added text-to-speech synthesis
Early versions were slow. Whisper ran on CPU. The LLM waited for Whisper to finish. Everything was sequential. A single round-trip took seconds—too long for conversation.
The Performance Wall
The first major realization came when we tested at scale: the system worked, but it was too slow. Time-to-first-response was measured in seconds, not milliseconds. This wasn't a minor optimization—it was an architectural problem.
Whisper on CPU Was the First Bottleneck
Whisper, OpenAI's speech recognition model, is excellent but compute-intensive. Running it on CPU made sense for prototyping, but the latency was prohibitive. This led to the first major decision: move to GPU.
The GPU Cluster Transition
Moving the Kubernetes cluster to GPU compute changed everything. Whisper that took 3-5 seconds on CPU now took 500ms on GPU. This was progress, but not enough alone.
The Contention Problem
With both Whisper and the LLM running on a single GPU, we hit another wall. Both models need memory and compute. Under load, they'd fight for resources, causing unpredictable latency spikes. The solution required adding a second GPU and splitting the workload—Whisper on one, LLM on the other.
Navigating VAD, Whisper, and LLM
Each component brought its own quirks:
Voice Activity Detection (VAD)
VAD needs to be sensitive enough to catch speech but not so sensitive that background noise triggers it. Getting this right required tuning thresholds and understanding what "silence" means in different contexts. Too aggressive and we'd cut off responses prematurely. Too lenient and we'd include noise.
Whisper: Balancing Accuracy and Speed
Whisper has multiple model sizes. The larger the model, the more accurate but slower the transcription. We settled on a middle ground that gives us accuracy without excessive latency. The tradeoff was intentional—we prioritized speed for real-time response over transcription perfection.
The LLM and Response Generation
The LLM generates tokens one at a time. To feel responsive, we needed to start playing back audio from the first token, not wait for the full response. This meant:
- Streaming token output rather than batch generation
- Using TTS to convert tokens incrementally to audio
- Managing sentence boundaries so responses don't feel fragmented
The Optimization Phase: Achieving Real-Time
The breakthrough came from understanding that real-time isn't about making one component faster—it's about overlapping processes. While Whisper is processing the user's speech, the LLM should be generating the response. While the LLM is generating, TTS should be converting earlier tokens to audio.
The dual-GPU architecture enabled this:
- GPU 1 (Whisper): Always ready for the next chunk of audio
- GPU 2 (LLM): Generating response tokens in parallel
- Streaming Pipeline: Each component feeds its output to the next immediately, no waiting
This reduced end-to-end latency from seconds to hundreds of milliseconds.
The Deployment Reality
Building it locally is one thing. Deploying it meant:
- Managing containerized services reliably
- Ensuring GPU resources were allocated correctly
- Handling real-world audio quality issues
- Debugging timing issues at scale
Each deployment revealed new edge cases that looked fine in testing but broke under real usage patterns.
What This Five-Month Journey Revealed
-
Architecture matters more than optimization. We couldn't optimize our way out of a sequential pipeline. Adding a second GPU was more valuable than micro-optimizing individual components.
-
Real-time is a systems problem. It's not about one fast component—it's about how components interact, when they communicate, and how you eliminate wait times.
-
Latency is perceptual. 100ms feels fast. 500ms feels natural. 2 seconds feels broken. The metrics that matter are the ones users feel.
-
Trade-offs are everywhere. Accuracy vs. speed. Feature completeness vs. real-time response. Single GPU simplicity vs. multi-GPU complexity. Every choice cascaded.
-
Production forces clarity. Prototype code is forgiving. Real systems expose hidden assumptions. We discovered issues that theoretical analysis missed.
Where We Are Now
Five months in, we have a system that can handle real-time voice conversations. Audio comes in, the system processes it through VAD, Whisper, and the LLM, and responds with synthesized speech—all with latency that feels natural.
Is it perfect? No. There are still optimizations to make, edge cases to handle, and performance headroom to claim. But the core—real-time, conversational AI through audio—works.
The Next Phase
The foundation is solid. From here, the work shifts to refinement: improving sentence formation, handling longer context, reducing latency further, and deploying at scale.
But the hard architectural work is done. We navigated VAD, Whisper, the LLM, and GPU resources to build something that felt impossible five months ago.
That's worth reflecting on.