By Rakesh Unni, Global Product Leader – Growth & Strategy, Pure IP
We need to have an honest conversation about the Lego approach to voice AI.
On paper, it looks like the perfect modern architecture. You pick a best-in-class CPaaS provider for ingress. You route that SIP stream to a specialized voice AI platform. That platform might then call out to a best-of-breed STT engine, a separate LLM provider, and a high-fidelity TTS service. It's modular, it's flexible, and it's avoiding vendor lock-in.
But after fifteen years of debugging carrier networks and building real-time voice solutions, I have watched this exact architecture become the single biggest bottleneck for enterprises trying to scale. That flexible stack you built is not just adding latency. It is introducing a hidden tax of complexity, fragility, and operational drag that makes truly natural conversation impossible.
Let's talk about why a vertically integrated provider, one that owns both the PSTN termination and the AI layer, ideally leveraging unified speech-to-speech models is the only way to get to production grade voice AI.
Everyone obsesses over model inference speed. They focus on making the LLM faster. But in a real-world voice call, the model is only one actor in a long and winding journey. Let me walk you through what actually happens in a typical stitched-together production call flow.
A customer's call hits a carrier and lands on a Session Border Controller (SBC) inside your CPaaS provider's data center, say in US-East. That CPaaS provider does not do AI, so they hairpin the call via SIP over the public internet to your voice AI provider's SBC, which might be in US-West.
Now the voice AI provider receives the audio. They send it to a dedicated STT vendor via an API call. The resulting text goes to an LLM via another API call. The response goes to a TTS vendor via a third API call. Finally, the synthesized audio is sent back through the voice AI provider, back across the public internet to the CPaaS provider's media server, and out to the customer.
If you are lucky, and all these services are well-optimized and sitting on the same cloud region, you might hit 600 to 800 milliseconds of end-to-end latency. But, the typical good stitched stack runs between 800 milliseconds and one and a half seconds when everything is behaving! Humans detect lag at about 200 milliseconds. At more than one second, you are not building a conversation, you are building a walkie-talkie experience.
The technical complexity is only half the problem. Multi-vendor stacks create operational nightmares that compound over time. When call quality degrades or AI response times spike, where do you even start? You're correlating logs across four different dashboards, each with different timestamp formats, different call identifiers, and different definitions of what constitutes a "session."
The carrier blames network condition and points to high latency from your AI vendor. The AI platform shows clean metrics on their end but suspects the SBC configuration. The STT vendor's logs show they received audio late. You spend hours reconstructing a single call flow from fragmented telemetry, only to discover the root cause was a routing change two providers upstream that nobody bothered to communicate.
Compliance becomes exponentially harder. You need to maintain call recording policies across multiple jurisdictions, but your PSTN provider stores metadata in one region, your AI platform processes audio in another, and your STT vendor has yet another data residency model.
Contrast this with a vertically integrated provider that owns both the carrier-grade telephony network and the voice AI platform. The advantage here isn't just about software. It's about network sovereignty. When a single vendor controls the PSTN interconnects, the private backbone, and the AI inference endpoints, the physics of latency change fundamentally.
In this model, traffic stays "on-net" for the entire lifecycle of the conversation. The provider doesn't hand off media streams to a 3rd party to reach an STT service; they route it securely across their own managed infrastructure where they control the quality of service tagging and routing logic. You eliminate the unpredictable variance of public internet hops and the "hair-pinning" of media between disparate providers.
Platform teams can tune the entire path as a single system. Because they own the SIP stack and the media gateways, they can implement aggressive optimizations like starting AI inference the moment audio packets arrive at the network edge, rather than waiting for a third-party API handshake. Observability also transforms from archaeological log reconstruction to real-time instrumentation. One trace ID follows a call from PSTN ingress through transcription, LLM inference, and synthesis. When latency spikes, you see exactly which component degraded and why, because the data isn't siloed in another vendor's black box.
The tight integration of network and compute is what unlocks the next evolution in this space: unified speech-to-speech models. These models bypass the traditional STT to LLM to TTS pipeline entirely. Instead of transcribing audio to text, processing it through a language model, and synthesizing new audio, these models process audio directly in the acoustic domain and generate spoken responses without intermediate text representations.
The latency benefits are dramatic. You eliminate two entire model inference steps and the serialization overhead between them. More importantly, you preserve prosody, emotion, and conversational dynamics that get lost in text-based intermediation. When a caller is frustrated, a speech-to-speech model can detect that in their voice directly and modulate its response accordingly.
However, speech-to-speech models require even lower latency tolerances than traditional pipelines because they are designed for truly real-time, interruptible conversations. They effectively require the network and the model to move in lockstep. If you're hair pinning calls between a third-party PSTN provider and a separate AI platform over the public internet, the jitter alone will destroy the "real-time" illusion before the model even processes the first frame of audio.
This isn't to say multi-vendor architectures never make sense. If you're experimenting with bleeding edge language models that only one provider offers, or you have regulatory requirements that mandate specific regional infrastructure, a modular approach may be your only option. Research teams building novel voice AI applications often benefit from swapping components rapidly to test hypotheses.
But here's the uncomfortable truth: Most production voice AI deployments don't need that flexibility. They need predictable latency, carrier-grade reliability, and operational simplicity at scale. When you're handling thousands of concurrent calls for revenue impacting use cases, the "best-of-breed" component you swap in for a 2% accuracy gain isn't worth the 200ms latency penalty and three-vendor finger-pointing sessions it introduces.
The voice AI market has matured rapidly. What required stitching together four different vendors in 2023 can now be delivered by single providers who've built vertically integrated stacks specifically for production telephony workloads. These platforms understand carrier-grade requirements because they are voice carriers. They offer unified SLAs because they control every component in the path, from the port on the PSTN switch to the final byte of generated audio.
If you're evaluating voice AI infrastructure in 2026, ask yourself: is your multi-vendor architecture delivering actual flexibility, or just operational complexity? Are you genuinely leveraging best-of-breed advantages, or are you spending engineering cycles playing integration whack-a-mole?
The illusion of flexibility is expensive. It costs you latency budget you can't afford, operational overhead that scales linearly with call volume, and the kind of systemic reliability issues that only surface under load. For production voice AI, the future isn't modular. It's integrated, owned, and optimized end-to-end.