Month 5 Week 3: GPU Migration and Real-Time Performance Optimization
November 16, 2025
Month 5 Week 3: GPU Migration and Real-Time Performance Optimization
The core challenge this week was performance at scale. We had Whisper running on the CPU-based Kubernetes cluster, and it was becoming a bottleneck. Real-time conversation requires every component to be fast—from capturing audio to generating the first token. This week was about moving pieces to GPU and making architectural changes to support real-time metrics.
The Whisper CPU Problem
Whisper running on CPU was viable for prototyping, but not for production real-time communication. The latency was noticeable—users would speak, and there'd be a delay waiting for transcription. When you're trying to maintain natural conversation flow, every millisecond matters. The CPU cluster was saturating under even moderate loads.
Migrating to GPU Kubernetes Cluster
Moving Whisper to the GPU cluster was the first move. This alone gave us significant speedups—Whisper is optimized for GPU execution, and the difference was immediate. Transcription that took seconds on CPU now took fractions of a second on GPU. But this wasn't just a plug-and-play migration. It involved:
- Reworking the pipeline to route audio through the GPU cluster
- Managing resource allocation so Whisper had the compute it needed
- Handling the orchestration between the Stasis application and the GPU cluster
Dual GPU Architecture for Parallel Processing
One GPU running both Whisper and LLM inference created contention. Both models compete for memory and compute, and you hit saturation quickly. The solution was adding a second GPU and splitting the workload:
- GPU 1: Whisper (speech-to-text) - isolated and optimized for audio processing
- GPU 2: LLM inference - handling the conversational response generation
This architectural split was crucial. Instead of one bottleneck, we now have parallel processing. Audio comes in, Whisper transcribes it on GPU 1, the LLM generates a response on GPU 2, and we feed it back to the user. No waiting for one model to finish before the other starts.
Time-to-First-Response: The Real Metric
What matters in real-time communication is time-to-first-response—how fast do we produce the first token of output after the user finishes speaking? This is what feels natural. With the dual-GPU setup:
- Whisper completes transcription in under 500ms for typical audio chunks
- LLM starts generating the first response token within 100-200ms of receiving the transcription
- The full pipeline from audio capture to voice output is approaching real-time viability
Sentence-Level Formation Complexity
Sentence formation at the LLM level became clearer this week. The model needs to understand when to emit a sentence—too early and you interrupt the thought, too late and the user feels lag. We're working with:
- Streaming token output rather than waiting for full responses
- Using punctuation and semantic markers to identify sentence boundaries
- Balancing between rapid feedback and coherent output
The dual-GPU setup helps here too. While Whisper is processing the next chunk of audio, the LLM can continue streaming its response. There's temporal overlap in processing that we couldn't achieve before.
What's Next
Real-time end-to-end communication is within reach. The next focus is optimizing the sentence formation logic further and ensuring the VAD (Voice Activity Detection) integration works seamlessly with the new GPU pipeline.