Sub-400ms response time. 20+ languages. Handles 1,000+ concurrent calls. We build production-grade voice AI agents on LiveKit, Vapi, and Retell. Engineers who've shipped this at scale.
One client cut their inbound call handling cost by 73% in the first 60 days. Another runs 1,400 concurrent outbound calls at peak with zero complaints about the bot.

The Problem We Solve
Before Hestur AI
After Hestur AI
Key Results
<400ms
Response Time
End-to-end latency from speech to AI reply
1,000+
Concurrent Calls
Simultaneous live conversations on a single deployment
20+
Languages
Including all major English accents and regional variants
73%
Cost Reduction
Average inbound call handling cost drop within 60 days
Technical Capabilities
A lot of companies tried voice AI in 2023 or 2024, had a rough experience, and concluded the technology wasn't ready. They're wrong about the conclusion. They're right that most of what got shipped was garbage.
The symptoms are familiar. The bot pauses for two full seconds before responding. It talks over the user the moment they're finishing a sentence. The voice sounds fine for the first few words and then degrades into a metallic rasp. It fails on any deviation from the happy path and loops "I'm sorry, I didn't understand that" until the caller gives up.
Most bad voice AI fails at three specific points. The VAD (voice activity detection) tuning is wrong, so the bot cuts in too early or sits there waiting too long. The LLM prompt doesn't handle interruptions, so when a user says "wait, actually" mid-sentence the bot just keeps going. And the TTS pipeline isn't streaming, so the whole response buffers before playback starts, creating that dead half-second gap that makes the thing feel mechanical.
None of these are fundamental technology limitations. They're implementation failures. All of them are fixable, and fixing them is exactly what we do.
We build two types of voice AI agents, and they're different enough that we treat them as separate services. Confusing them is how you end up with the wrong architecture.
Inbound is the more forgiving environment technically but the more demanding one for quality. When someone calls your business, they've already chosen to reach out. That moment is yours to lose. An inbound voice AI agent that fumbles the call costs you something real: that lead, that appointment, that renewal.
Our inbound AI voice agents handle appointment scheduling, lead qualification, FAQ resolution, call routing, order status, and after-hours coverage. We build them for the full range of caller behavior, not just the happy path. That means graceful fallbacks, warm handoffs to human agents when needed, sentiment analysis running throughout the call, and structured summaries pushed to your CRM the moment the call ends.
The key metric we optimize in inbound builds: containment rate. How many calls get fully resolved without transferring to a human? Clients typically come in at 20-30% containment with existing IVR systems. We get them to 65-80% within the first 30 days.
Outbound is where the economics get genuinely dramatic. Running an outbound AI calling campaign is a fundamentally different cost structure than a human SDR team or even a traditional auto-dialer.
Think about what 1,000 concurrent outbound calls actually means. No human team can do that. No human team can do it at 11pm on a Sunday. No human team can do it in 23 languages simultaneously. We build outbound voice AI agents that can.
We've built outbound systems for AI cold calling and lead qualification, appointment setting, payment reminders, post-purchase follow-up, renewal outreach, and survey calls. The AI scheduling agent we built for a real estate firm books 3x the appointments their inside sales team was booking, at a fraction of the cost.
A 1.2-second natural pause after "Hello?" before the agent speaks performs 34% better on connect-to-conversation rate than zero pause. Matching the caller's pace reduces hang-up rates by around 22%. These are the details that separate outbound AI calling that works from outbound AI calling that burns your reputation.
The standard architecture for a conversational voice AI agent has three stages, and every millisecond counts.
Stage 1: Speech-to-text. The caller's audio stream goes to Deepgram Nova-3. It's consistently the fastest with the lowest word error rate, especially on accented English and domain-specific vocabulary. Streaming transcription takes 80-120ms in production.
Stage 2: LLM response generation. We use GPT-4o for most conversational use cases, Claude 3.5 Sonnet when context and reasoning matter more, or Llama 3.3 70B on Groq when we need sub-200ms LLM latency. We stream tokens as they arrive. Time-to-first-token is typically 100-200ms.
Stage 3: Text-to-speech. The first chunk of LLM output goes immediately to ElevenLabs Turbo v2.5, which starts outputting audio in 60-100ms. The caller hears the agent start speaking while the LLM is still generating the rest of the sentence.
Done right, end-to-end latency from the caller stopping speaking to the agent starting to respond is under 400ms. We've hit 280ms on good network conditions. Median in production: 320-380ms. Done wrong, it's 900ms and the experience is ruined. The difference is streaming implementation and VAD tuning, not model choice.
OpenAI's Realtime API (GPT-4o Realtime) cuts the pipeline entirely. Audio in, audio out, in roughly 300-350ms. The conversational naturalness is impressive. The model handles interruptions, emotional shifts, and prosody in real time.
The tradeoff is control. With the classic pipeline, you can swap components and use a custom cloned voice. With the Realtime API, you get GPT-4o's voice. You can't swap in a custom voice or route complex intents to a specialized model. For enterprise deployments with brand voice requirements, you usually want the classic pipeline. We've shipped both. The recommendation comes from the discovery session.
Vapi is our starting point for most mid-market builds. Handles SIP, WebRTC, PSTN, and has pre-built CRM integrations. Honest take: wrong choice if you're expecting 500,000+ minutes per month or need deep customization of conversation state.
LiveKit Agents gives you full infrastructure control: custom STT/LLM/TTS configurations, custom VAD and turn detection, full event streaming, and HIPAA-compliant data handling. More complex to build, but the right answer for healthcare, financial services, and anything that needs real compliance.
Retell AI has the best out-of-the-box turn detection we've tested. Turn detection is what determines whether your agent talks over the caller or waits appropriately. Getting it wrong is the fastest way to make voice AI feel robotic. Retell's approach is genuinely good, and the developer experience makes it excellent for fast-moving builds.
The honest answer: it depends on what you're willing to build, not on what the technology can do.
The baseline we ship uses ElevenLabs Turbo v2.5 with a custom cloned voice, trained on 30+ minutes of your existing audio. We've run blind tests. The vast majority of listeners don't flag it.
But the voice model isn't what makes AI sound robotic. It's the conversation logic. A beautiful synthetic voice saying "I'm sorry, I didn't understand that" for the third time sounds robotic. A slightly synthetic voice that catches a mid-sentence interruption gracefully, remembers what was said two turns ago, and gives a genuinely useful answer sounds human.
The factors that actually determine perceived naturalness:
Sentiment analysis runs throughout every call. If the caller's tone shifts toward frustration and crosses a threshold, the agent queues a warm handoff to a human. This is how you prevent AI from escalating situations a person would de-escalate.
We don't pad these numbers. Here's the realistic range from what we've shipped.
Healthcare is where we've seen the highest ROI on inbound voice AI, and also the highest stakes. A patient who can't get through to schedule doesn't always call back. They go somewhere else.
We build HIPAA-aware inbound agents on LiveKit with end-to-end encryption. They handle new patient intake, appointment scheduling, insurance pre-verification scripting, prescription refill routing, and after-hours triage. The agent can pull from your EHR to answer formulaic clinical questions without a staff member on the call.
Speed-to-lead is everything in real estate, and the gap between when a lead submits a form and when a human calls back is where deals die. An AI sales caller that responds within 2 minutes of a web form submission, qualifies the lead across 8-10 criteria, and books a showing into the agent's calendar is a genuine competitive edge.
Voice AI for finance requires compliance-aware conversation design. We've built outbound AI voice agents for insurance that handle renewal outreach, claims status calls, and payment reminders, with automatic flagging for regulatory-trigger phrases and escalation paths built into every call flow.
The missed-call recovery story. A plumbing, HVAC, or landscaping company whose owner is on a job site misses 30-40% of inbound calls. Each one is a job that might go to a competitor. An AI receptionist for small business that picks up every call, books the job, and sends a confirmation text solves this completely. Two-week setup. Runs 24/7. Costs less per month than two hours of a receptionist's time.
We've built AI SDR agents for SaaS companies targeting trial users who haven't converted. The agent qualifies interest, surfaces blockers, and books demos. It runs at a scale and on a schedule no human sales team can match.
Orchestration: Vapi, LiveKit Agents, Retell AI, Pipecat
Speech-to-text: Deepgram Nova-3 (default), AssemblyAI, OpenAI Whisper
LLM: GPT-4o, Claude 3.5 Sonnet, Llama 3.3 70B on Groq, fine-tuned open-source models
Text-to-speech: ElevenLabs Turbo v2.5 with custom AI voice cloning, Cartesia Sonic, Azure Neural
Telephony: Twilio, Telnyx
CRM and integrations: Salesforce, HubSpot, Close, Pipedrive, Google Calendar, Cal.com, n8n, Zapier, Slack
Two to four weeks from kickoff to production.
Week 1: PoC. Days 1-2: discovery, conversation flow mapping, integration spec. Days 3-5: working PoC. A real phone number. You call it. You break it. You give us feedback.
Week 2: Iteration. Edge case testing, VAD tuning, LLM prompt optimization, latency profiling. Usually 2-3 rounds based on your team's testing.
Weeks 3-4: Integration and Deployment. CRM integration, telephony setup, monitoring, production deployment with fallback routing, and full handoff documentation.
After launch: 30-day optimization window. We review call recordings, identify failure modes, and push patches. The first 30 days in production are where you find the edge cases testing misses.
We don't publish fixed prices because the range is genuinely wide. Here's enough to self-qualify.
All engagements start with a free PoC. We'd rather prove value in week one than convince you over three weeks.
How It Works
From discovery to production in weeks, not quarters
01
Pull 3–6 months of transcripts. Identify your top 10–15 call intents by volume — these become your automation targets.
02
In 30 minutes we map your intents, phone system, and CRM integrations to produce a fixed PoC scope and price.
03
Working agent on a test line using your real data. Test against real scenarios before any production deployment.
04
Connect to CRM, scheduling, or order management. Go live with escalation paths and monitoring dashboards in place.
Industry Applications
Healthcare
Patient intake, appointment scheduling, insurance pre-verification, prescription refill routing, and after-hours triage.
Handles intake for 500+ patients/day without staff
Financial Services
Account balance queries, fraud alert confirmation, loan status, and appointment booking with compliance-aware design.
Compliance-aware conversation design built in
E-commerce
Order status, returns initiation, product questions, and escalation to live agents for complex issues.
68% of calls resolved without human intervention
Property Management
Maintenance request intake, rent payment queries, lease renewal scheduling, and emergency routing.
24/7 tenant support with zero overnight staffing
Automotive
Service appointment booking, recall notifications, parts availability, and dealer routing for inbound calls.
Booking conversion rate up 34% vs. hold queue
Outbound Sales
High-volume outbound campaigns, lead qualification, appointment setting, and warm transfer to closers.
1,400 concurrent outbound calls at peak
Frequently Asked Questions
How long does it take to build a voice AI agent?
The PoC is done in 5 days. A production deployment with CRM integration, telephony setup, and QA takes 2-4 weeks depending on complexity. We've shipped straightforward inbound builds in 10 days. Complex multi-tenant outbound systems have taken 6 weeks. The scoping call on day one tells you which yours will be.
Will it actually sound human, or will my customers know they're talking to AI?
With ElevenLabs voice cloning and proper VAD tuning, most callers don't flag it in blind tests. What matters more than voice quality is response latency and conversation quality. An agent that responds in under 400ms, handles interruptions gracefully, and gives useful answers reads as human even if the voice isn't perfect. The voice is only part of the story.
What's the difference between Vapi, LiveKit, and Retell AI?
Vapi deploys fastest and has the most pre-built integrations. Retell has the best out-of-the-box turn detection. LiveKit gives you full infrastructure control and is the right call for HIPAA, compliance-heavy use cases, or anything that outgrows a SaaS model. We'll recommend the right one in the first consultation.
How do you handle compliance for healthcare or financial services?
Healthcare builds run on LiveKit with end-to-end encryption. We don't use Vapi or Retell for HIPAA-sensitive call data. For financial services, we build compliance-aware conversation design with automatic flagging for regulatory trigger phrases and escalation rules. We've built this. It's not theoretical.
What happens when the AI doesn't know the answer?
We build explicit fallback paths. The agent doesn't loop "I'm sorry, I didn't understand that." It does something useful: connects to a human, schedules a callback, captures the question for follow-up, or acknowledges the gap and moves on. Fallback design is part of the conversation architecture.
Can you clone our existing brand voice?
Yes. We need 30+ minutes of clean audio from the speaker you want to clone. We use ElevenLabs Professional Voice Cloning. In blind listening tests, the result is indistinguishable from the original for the vast majority of listeners.
How does CRM sync and sentiment analysis work?
Sentiment analysis runs in real time throughout every call. At call end, the agent pushes a call summary, sentiment score, key facts captured, and full transcript to your CRM. Native integrations for Salesforce and HubSpot. Webhook flows via n8n or Zapier for everything else. If sentiment crosses a frustration threshold, the agent queues a warm transfer and the CRM record gets flagged.
What languages do you support?
20+ languages out of the box: English (all major accents), Spanish, French, German, Portuguese, Italian, Dutch, Polish, Japanese, Korean, Mandarin, Hindi, Arabic, and more. Deepgram Nova-3 handles STT for most languages. ElevenLabs covers 30+ languages for TTS.
What does 1,000+ concurrent calls mean in practice?
It means 1,000 separate live conversations at the same time. For outbound, that means dialing a list of 10,000 contacts and working through it in hours, not days. We've run load tests at 1,400 concurrent calls on LiveKit infrastructure without quality degradation. Horizontal scaling is built into the architecture from day one.
Can we run a PoC before committing to a full build?
Yes, this is standard. We build a working prototype in week one. You can call it, test it, and decide. If you don't move forward, you keep the prototype with no obligation. We offer this because we're confident in what we build.
Book a 30-minute call. We'll scope your use case, spec the right architecture, and have a working PoC on your desk inside a week. No slide decks. No proposals that go nowhere. Just a phone number you can call to talk to the thing we built.