Month 5 Week 3: GPU Migration and Real-Time Performance Optimization

The core challenge this week was performance at scale. We had Whisper running on the CPU-based Kubernetes cluster, and it was becoming a bottleneck. Real-time conversation requires every component to be fast—from capturing audio to generating the first token. This week was about moving pieces to GPU and making architectural changes to support real-time metrics.

The Whisper CPU Problem

Whisper running on CPU was viable for prototyping, but not for production real-time communication. The latency was noticeable—users would speak, and there'd be a delay waiting for transcription. When you're trying to maintain natural conversation flow, every millisecond matters. The CPU cluster was saturating under even moderate loads.

Migrating to GPU Kubernetes Cluster

Moving Whisper to the GPU cluster was the first move. This alone gave us significant speedups—Whisper is optimized for GPU execution, and the difference was immediate. Transcription that took seconds on CPU now took fractions of a second on GPU. But this wasn't just a plug-and-play migration. It involved:

Reworking the pipeline to route audio through the GPU cluster
Managing resource allocation so Whisper had the compute it needed
Handling the orchestration between the Stasis application and the GPU cluster

Dual GPU Architecture for Parallel Processing

One GPU running both Whisper and LLM inference created contention. Both models compete for memory and compute, and you hit saturation quickly. The solution was adding a second GPU and splitting the workload:

GPU 1: Whisper (speech-to-text) - isolated and optimized for audio processing
GPU 2: LLM inference - handling the conversational response generation

This architectural split was crucial. Instead of one bottleneck, we now have parallel processing. Audio comes in, Whisper transcribes it on GPU 1, the LLM generates a response on GPU 2, and we feed it back to the user. No waiting for one model to finish before the other starts.

Time-to-First-Response: The Real Metric

What matters in real-time communication is time-to-first-response—how fast do we produce the first token of output after the user finishes speaking? This is what feels natural. With the dual-GPU setup:

Whisper completes transcription in under 500ms for typical audio chunks
LLM starts generating the first response token within 100-200ms of receiving the transcription
The full pipeline from audio capture to voice output is approaching real-time viability

Sentence-Level Formation Complexity

Sentence formation at the LLM level became clearer this week. The model needs to understand when to emit a sentence—too early and you interrupt the thought, too late and the user feels lag. We're working with:

Streaming token output rather than waiting for full responses
Using punctuation and semantic markers to identify sentence boundaries
Balancing between rapid feedback and coherent output

The dual-GPU setup helps here too. While Whisper is processing the next chunk of audio, the LLM can continue streaming its response. There's temporal overlap in processing that we couldn't achieve before.

What's Next

Real-time end-to-end communication is within reach. The next focus is optimizing the sentence formation logic further and ensuring the VAD (Voice Activity Detection) integration works seamlessly with the new GPU pipeline.