A voice AI agent usually fails in production on the call itself, rather than on the model. The reasoning model is rarely the bottleneck. Failures cluster in the pipeline around it: latency, degraded audio, numbers that fail to connect in-country, and dropped calls.
Voice agents are getting cheaper and more capable by the month. The reasoning model has become the commodity layer of the stack.
Microsoft made that concrete. On May 1 2026 it launched Microsoft 365 E7, the Frontier Suite, at $99 per user per month. The model sits inside it as one line item among many. Across software, seat-based pricing fell from 21% to 15% of companies in twelve months, while consumption and hybrid models rose from 27% to 41%. The model layer is converging on price and capability.
If that is the part getting cheaper and more interchangeable, it is not where a voice agent earns its keep.
Does a voice agent fail on the model or the phone line?
On the phone line, most of the time. AssemblyAI's 2026 Voice Agent Report found 82.5% of builders feel confident building voice agents, while 75% hit reliability barriers in production. The gap is not intelligence. It is what happens on a live call.
A voice agent is a pipeline. Speech-to-text, then the model, then text-to-speech, then telephony. Each stage adds latency and a way to fail. Production audio arrives through compressed telephony codecs, from cars and call centers and cheap headsets. A small drop in recognition accuracy cascades downstream. The consensus among people who run these systems is blunt: the fix is not a faster model.
What part of a voice agent do the AI vendors not own?
The telephony layer. Voice AI platforms own the model, the speech-to-text, and the text-to-speech. The telephony underneath, the SIP trunk and the carrier connections and the numbers, gets bundled in by default and delegated to an upstream provider. The enterprise inherits whatever carrier the platform happens to route through.
For a single market that can be fine. For a multinational it is the whole problem. Numbers have to connect in-country, and calls have to meet rules that shift by jurisdiction. The audio has to hold up across regions, on a recorded line. None of that is the model's job, and none of it is what the AI vendor was built to do.
Why does the call layer decide the ROI?
Because the savings only land on calls that hold. Gartner projects conversational AI will cut roughly $80 billion from contact-center costs by the end of 2026. That return depends on callers staying on the line. A dropped call or a half-second of dead air ends the conversation before the agent earns anything. The return lives in the quality of the call, not the output of the model.
The layer that has to work first
Voice agents are only as good as the call they sit on. A reasoning model means nothing if the audio drops or the numbers do not connect in-country.
Pure IP delivers the licensed-carrier voice layer that keeps agent calls connected and compliant across regions. That is the part of the stack the AI vendors do not own. That layer is voice service in 137 countries, with PSTN replacement in 50+ of them.
Next step
Your voice agent works in a pilot. It stalls when you put it on real customer lines across countries. The gap is the carrier layer.
Talk to the team that runs PSTN replacement in 50+ countries about the voice infrastructure your agents connect through.