Month 5 Week 4: Five Month Reflexion

Five months ago, there was an idea. Now there's a system that can have real-time conversations with users through audio. This post reflects on what that journey looked like—the decision points, the dead ends, the breakthroughs, and what it took to go from concept to a working, optimized product.

The Initial Vision

The starting point was clear in intent but vague in execution: build an AI system that can communicate with users in real-time through voice. Not text-based. Not batched. Real-time—where the AI listens, understands, and responds naturally within the flow of conversation.

That's straightforward to state. The actual implementation required navigating several layers of complexity that weren't obvious at the start.

The Design Phase: Laying Out the Architecture

Early on, we created design documents that outlined the pipeline:

Audio Input → Capture user voice
Voice Activity Detection (VAD) → Determine when the user is speaking
Speech-to-Text → Transcribe audio with Whisper
Language Model → Generate a contextual response
Text-to-Speech → Convert response back to audio
Audio Output → Play to user

This looks logical on paper. The challenge was making each stage fast enough that the cumulative latency felt natural. A user won't tolerate 5 seconds of silence after speaking before hearing the first response.

Wireframing the System

Before writing code, we mapped out:

How audio would flow through the system
Where bottlenecks might appear
What components needed to talk to each other
Where we could parallelize processing

The Stasis application emerged as the orchestrator—the piece that would manage audio streams, coordinate with the GPU cluster, and keep everything in sync. This was critical because without a clear architectural center, the system would have been a tangle of interconnected services with timing issues.

The Coding Phase: Building the Pipeline

Implementation started with the fundamentals:

Set up a Kubernetes cluster to run the AI models
Created a Python service to handle Whisper inference
Built the Stasis application to orchestrate audio flow
Integrated the LLM for response generation
Added text-to-speech synthesis

Early versions were slow. Whisper ran on CPU. The LLM waited for Whisper to finish. Everything was sequential. A single round-trip took seconds—too long for conversation.

The Performance Wall

The first major realization came when we tested at scale: the system worked, but it was too slow. Time-to-first-response was measured in seconds, not milliseconds. This wasn't a minor optimization—it was an architectural problem.

Whisper on CPU Was the First Bottleneck

Whisper, OpenAI's speech recognition model, is excellent but compute-intensive. Running it on CPU made sense for prototyping, but the latency was prohibitive. This led to the first major decision: move to GPU.

The GPU Cluster Transition

Moving the Kubernetes cluster to GPU compute changed everything. Whisper that took 3-5 seconds on CPU now took 500ms on GPU. This was progress, but not enough alone.

The Contention Problem

With both Whisper and the LLM running on a single GPU, we hit another wall. Both models need memory and compute. Under load, they'd fight for resources, causing unpredictable latency spikes. The solution required adding a second GPU and splitting the workload—Whisper on one, LLM on the other.

Navigating VAD, Whisper, and LLM

Each component brought its own quirks:

Voice Activity Detection (VAD)

VAD needs to be sensitive enough to catch speech but not so sensitive that background noise triggers it. Getting this right required tuning thresholds and understanding what "silence" means in different contexts. Too aggressive and we'd cut off responses prematurely. Too lenient and we'd include noise.

Whisper: Balancing Accuracy and Speed

Whisper has multiple model sizes. The larger the model, the more accurate but slower the transcription. We settled on a middle ground that gives us accuracy without excessive latency. The tradeoff was intentional—we prioritized speed for real-time response over transcription perfection.

The LLM and Response Generation

The LLM generates tokens one at a time. To feel responsive, we needed to start playing back audio from the first token, not wait for the full response. This meant:

Streaming token output rather than batch generation
Using TTS to convert tokens incrementally to audio
Managing sentence boundaries so responses don't feel fragmented

The Optimization Phase: Achieving Real-Time

The breakthrough came from understanding that real-time isn't about making one component faster—it's about overlapping processes. While Whisper is processing the user's speech, the LLM should be generating the response. While the LLM is generating, TTS should be converting earlier tokens to audio.

The dual-GPU architecture enabled this:

GPU 1 (Whisper): Always ready for the next chunk of audio
GPU 2 (LLM): Generating response tokens in parallel
Streaming Pipeline: Each component feeds its output to the next immediately, no waiting

This reduced end-to-end latency from seconds to hundreds of milliseconds.

The Deployment Reality

Building it locally is one thing. Deploying it meant:

Managing containerized services reliably
Ensuring GPU resources were allocated correctly
Handling real-world audio quality issues
Debugging timing issues at scale

Each deployment revealed new edge cases that looked fine in testing but broke under real usage patterns.

What This Five-Month Journey Revealed

Architecture matters more than optimization. We couldn't optimize our way out of a sequential pipeline. Adding a second GPU was more valuable than micro-optimizing individual components.
Real-time is a systems problem. It's not about one fast component—it's about how components interact, when they communicate, and how you eliminate wait times.
Latency is perceptual. 100ms feels fast. 500ms feels natural. 2 seconds feels broken. The metrics that matter are the ones users feel.
Trade-offs are everywhere. Accuracy vs. speed. Feature completeness vs. real-time response. Single GPU simplicity vs. multi-GPU complexity. Every choice cascaded.
Production forces clarity. Prototype code is forgiving. Real systems expose hidden assumptions. We discovered issues that theoretical analysis missed.

Where We Are Now

Five months in, we have a system that can handle real-time voice conversations. Audio comes in, the system processes it through VAD, Whisper, and the LLM, and responds with synthesized speech—all with latency that feels natural.

Is it perfect? No. There are still optimizations to make, edge cases to handle, and performance headroom to claim. But the core—real-time, conversational AI through audio—works.

The Next Phase

The foundation is solid. From here, the work shifts to refinement: improving sentence formation, handling longer context, reducing latency further, and deploying at scale.

But the hard architectural work is done. We navigated VAD, Whisper, the LLM, and GPU resources to build something that felt impossible five months ago.

That's worth reflecting on.