Voice AI Agent Development

Voice AI Agent Development That Your Customers Won't Hang Up On

Sub-400ms response time. 20+ languages. Handles 1,000+ concurrent calls. We build production-grade voice AI agents on LiveKit, Vapi, and Retell. Engineers who've shipped this at scale.

One client cut their inbound call handling cost by 73% in the first 60 days. Another runs 1,400 concurrent outbound calls at peak with zero complaints about the bot.

Book a Free Consultation Claim Your Free PoC

Abstract visualization of voice AI: sound waves transitioning into structured data and CRM records

The Problem We Solve

Before Hestur AI

Call centre handling 1,000+ calls/day with 40+ agents
Average handle time of 8 minutes per call
Long hold times and dropped calls frustrating customers
No coverage outside business hours
Cost per call of $12–$18 including agent overhead

After Hestur AI

Sub-400ms response time — sounds human
1,000+ concurrent calls on a single deployment
20+ languages with accent-matched voices
24/7 availability with zero hold time
73% reduction in inbound call handling cost

Key Results

<400ms

Response Time

End-to-end latency from speech to AI reply

1,000+

Concurrent Calls

Simultaneous live conversations on a single deployment

20+

Languages

Including all major English accents and regional variants

73%

Cost Reduction

Average inbound call handling cost drop within 60 days

Technical Capabilities

LiveKit, Vapi, and Retell AI orchestrationDeepgram Nova-3 speech-to-textElevenLabs voice cloning (30+ languages)Real-time sentiment analysis on every callCRM sync: Salesforce, HubSpot, and webhooksHIPAA-compliant builds on LiveKit self-hostedInterruption handling and turn detectionFull call transcription and taggingOutbound dialling at 1,400+ concurrent callsSub-500ms end-to-end pipeline latency

What's Actually Wrong With Most Voice AI Right Now

A lot of companies tried voice AI in 2023 or 2024, had a rough experience, and concluded the technology wasn't ready. They're wrong about the conclusion. They're right that most of what got shipped was garbage.

The symptoms are familiar. The bot pauses for two full seconds before responding. It talks over the user the moment they're finishing a sentence. The voice sounds fine for the first few words and then degrades into a metallic rasp. It fails on any deviation from the happy path and loops "I'm sorry, I didn't understand that" until the caller gives up.

Most bad voice AI fails at three specific points. The VAD (voice activity detection) tuning is wrong, so the bot cuts in too early or sits there waiting too long. The LLM prompt doesn't handle interruptions, so when a user says "wait, actually" mid-sentence the bot just keeps going. And the TTS pipeline isn't streaming, so the whole response buffers before playback starts, creating that dead half-second gap that makes the thing feel mechanical.

None of these are fundamental technology limitations. They're implementation failures. All of them are fixable, and fixing them is exactly what we do.

Two Distinct Products: Inbound and Outbound Voice AI

We build two types of voice AI agents, and they're different enough that we treat them as separate services. Confusing them is how you end up with the wrong architecture.

Inbound Voice AI: Your 24/7 AI Receptionist

Inbound is the more forgiving environment technically but the more demanding one for quality. When someone calls your business, they've already chosen to reach out. That moment is yours to lose. An inbound voice AI agent that fumbles the call costs you something real: that lead, that appointment, that renewal.

Our inbound AI voice agents handle appointment scheduling, lead qualification, FAQ resolution, call routing, order status, and after-hours coverage. We build them for the full range of caller behavior, not just the happy path. That means graceful fallbacks, warm handoffs to human agents when needed, sentiment analysis running throughout the call, and structured summaries pushed to your CRM the moment the call ends.

The key metric we optimize in inbound builds: containment rate. How many calls get fully resolved without transferring to a human? Clients typically come in at 20-30% containment with existing IVR systems. We get them to 65-80% within the first 30 days.

Outbound Voice AI: Scale That's Impossible With Humans

Outbound is where the economics get genuinely dramatic. Running an outbound AI calling campaign is a fundamentally different cost structure than a human SDR team or even a traditional auto-dialer.

Think about what 1,000 concurrent outbound calls actually means. No human team can do that. No human team can do it at 11pm on a Sunday. No human team can do it in 23 languages simultaneously. We build outbound voice AI agents that can.

We've built outbound systems for AI cold calling and lead qualification, appointment setting, payment reminders, post-purchase follow-up, renewal outreach, and survey calls. The AI scheduling agent we built for a real estate firm books 3x the appointments their inside sales team was booking, at a fraction of the cost.

A 1.2-second natural pause after "Hello?" before the agent speaks performs 34% better on connect-to-conversation rate than zero pause. Matching the caller's pace reduces hang-up rates by around 22%. These are the details that separate outbound AI calling that works from outbound AI calling that burns your reputation.

How It Works Under the Hood

The Classic Pipeline: STT + LLM + TTS

The standard architecture for a conversational voice AI agent has three stages, and every millisecond counts.

Stage 1: Speech-to-text. The caller's audio stream goes to Deepgram Nova-3. It's consistently the fastest with the lowest word error rate, especially on accented English and domain-specific vocabulary. Streaming transcription takes 80-120ms in production.

Stage 2: LLM response generation. We use GPT-4o for most conversational use cases, Claude 3.5 Sonnet when context and reasoning matter more, or Llama 3.3 70B on Groq when we need sub-200ms LLM latency. We stream tokens as they arrive. Time-to-first-token is typically 100-200ms.

Stage 3: Text-to-speech. The first chunk of LLM output goes immediately to ElevenLabs Turbo v2.5, which starts outputting audio in 60-100ms. The caller hears the agent start speaking while the LLM is still generating the rest of the sentence.

Done right, end-to-end latency from the caller stopping speaking to the agent starting to respond is under 400ms. We've hit 280ms on good network conditions. Median in production: 320-380ms. Done wrong, it's 900ms and the experience is ruined. The difference is streaming implementation and VAD tuning, not model choice.

The Speech-to-Speech Alternative

OpenAI's Realtime API (GPT-4o Realtime) cuts the pipeline entirely. Audio in, audio out, in roughly 300-350ms. The conversational naturalness is impressive. The model handles interruptions, emotional shifts, and prosody in real time.

The tradeoff is control. With the classic pipeline, you can swap components and use a custom cloned voice. With the Realtime API, you get GPT-4o's voice. You can't swap in a custom voice or route complex intents to a specialized model. For enterprise deployments with brand voice requirements, you usually want the classic pipeline. We've shipped both. The recommendation comes from the discovery session.

Orchestration Platforms

Vapi is our starting point for most mid-market builds. Handles SIP, WebRTC, PSTN, and has pre-built CRM integrations. Honest take: wrong choice if you're expecting 500,000+ minutes per month or need deep customization of conversation state.

LiveKit Agents gives you full infrastructure control: custom STT/LLM/TTS configurations, custom VAD and turn detection, full event streaming, and HIPAA-compliant data handling. More complex to build, but the right answer for healthcare, financial services, and anything that needs real compliance.

Retell AI has the best out-of-the-box turn detection we've tested. Turn detection is what determines whether your agent talks over the caller or waits appropriately. Getting it wrong is the fastest way to make voice AI feel robotic. Retell's approach is genuinely good, and the developer experience makes it excellent for fast-moving builds.

The Question Everyone Asks: Will It Sound Like a Robot?

The honest answer: it depends on what you're willing to build, not on what the technology can do.

The baseline we ship uses ElevenLabs Turbo v2.5 with a custom cloned voice, trained on 30+ minutes of your existing audio. We've run blind tests. The vast majority of listeners don't flag it.

But the voice model isn't what makes AI sound robotic. It's the conversation logic. A beautiful synthetic voice saying "I'm sorry, I didn't understand that" for the third time sounds robotic. A slightly synthetic voice that catches a mid-sentence interruption gracefully, remembers what was said two turns ago, and gives a genuinely useful answer sounds human.

The factors that actually determine perceived naturalness:

Response latency. Sub-400ms is where human-feeling conversation happens. Over 600ms, the call starts feeling like a phone system.
Interrupt handling. Proper barge-in support is non-negotiable. Skip the VAD tuning and you'll get a bot that talks over the user every third sentence.
Context memory within the call. The agent should know what was said earlier in the conversation.
Prosody matching. Adjusting speaking pace to match the caller's energy makes a measurable difference.
Graceful unknown handling. Not "I didn't understand" but "Let me make sure I have this right."

Sentiment analysis runs throughout every call. If the caller's tone shifts toward frustration and crosses a threshold, the agent queues a warm handoff to a human. This is how you prevent AI from escalating situations a person would de-escalate.

What to Expect: Real Results

We don't pad these numbers. Here's the realistic range from what we've shipped.

Inbound performance

Containment rate: 65-80% (vs. 20-30% for traditional IVR)
Cost per call: $0.04-0.12 (vs. $4-15 for human agents)
After-hours lead capture: 2-3x improvement
Average handle time: 3-4 minutes (vs. 8-12 for human agents)

Outbound performance

Concurrent calls: 1,000+
Cost reduction vs. human SDR team: 60-73% (source of the $300K/year savings figure)
Appointment set rate on warm lists: 8-14% depending on vertical and script quality
Languages supported: 20+

Use Cases We've Built For

Healthcare: AI Receptionist for Medical and Dental Practices

Healthcare is where we've seen the highest ROI on inbound voice AI, and also the highest stakes. A patient who can't get through to schedule doesn't always call back. They go somewhere else.

We build HIPAA-aware inbound agents on LiveKit with end-to-end encryption. They handle new patient intake, appointment scheduling, insurance pre-verification scripting, prescription refill routing, and after-hours triage. The agent can pull from your EHR to answer formulaic clinical questions without a staff member on the call.

Real Estate: Lead Qualification at Scale

Speed-to-lead is everything in real estate, and the gap between when a lead submits a form and when a human calls back is where deals die. An AI sales caller that responds within 2 minutes of a web form submission, qualifies the lead across 8-10 criteria, and books a showing into the agent's calendar is a genuine competitive edge.

Financial Services and Insurance

Voice AI for finance requires compliance-aware conversation design. We've built outbound AI voice agents for insurance that handle renewal outreach, claims status calls, and payment reminders, with automatic flagging for regulatory-trigger phrases and escalation paths built into every call flow.

Home Services and SMBs

The missed-call recovery story. A plumbing, HVAC, or landscaping company whose owner is on a job site misses 30-40% of inbound calls. Each one is a job that might go to a competitor. An AI receptionist for small business that picks up every call, books the job, and sends a confirmation text solves this completely. Two-week setup. Runs 24/7. Costs less per month than two hours of a receptionist's time.

SaaS and B2B Outbound

We've built AI SDR agents for SaaS companies targeting trial users who haven't converted. The agent qualifies interest, surfaces blockers, and books demos. It runs at a scale and on a schedule no human sales team can match.

The Stack

Orchestration: Vapi, LiveKit Agents, Retell AI, Pipecat

Speech-to-text: Deepgram Nova-3 (default), AssemblyAI, OpenAI Whisper

LLM: GPT-4o, Claude 3.5 Sonnet, Llama 3.3 70B on Groq, fine-tuned open-source models

Text-to-speech: ElevenLabs Turbo v2.5 with custom AI voice cloning, Cartesia Sonic, Azure Neural

Telephony: Twilio, Telnyx

CRM and integrations: Salesforce, HubSpot, Close, Pipedrive, Google Calendar, Cal.com, n8n, Zapier, Slack

Process and Timeline

Two to four weeks from kickoff to production.

Week 1: PoC. Days 1-2: discovery, conversation flow mapping, integration spec. Days 3-5: working PoC. A real phone number. You call it. You break it. You give us feedback.

Week 2: Iteration. Edge case testing, VAD tuning, LLM prompt optimization, latency profiling. Usually 2-3 rounds based on your team's testing.

Weeks 3-4: Integration and Deployment. CRM integration, telephony setup, monitoring, production deployment with fallback routing, and full handoff documentation.

After launch: 30-day optimization window. We review call recordings, identify failure modes, and push patches. The first 30 days in production are where you find the edge cases testing misses.

What This Costs

We don't publish fixed prices because the range is genuinely wide. Here's enough to self-qualify.

Simple inbound AI receptionist (SMB): $8,000-15,000 build + $200-500/month in platform costs
Mid-market inbound agent (healthcare, real estate, SaaS): $20,000-45,000 + $500-2,000/month
Enterprise outbound calling system: $50,000-150,000+ depending on scale and compliance requirements

All engagements start with a free PoC. We'd rather prove value in week one than convince you over three weeks.

How It Works

From discovery to production in weeks, not quarters

Audit Call Mix

Pull 3–6 months of transcripts. Identify your top 10–15 call intents by volume — these become your automation targets.

Scoping Call

In 30 minutes we map your intents, phone system, and CRM integrations to produce a fixed PoC scope and price.

Proof of Concept

Working agent on a test line using your real data. Test against real scenarios before any production deployment.

Integrate & Deploy

Connect to CRM, scheduling, or order management. Go live with escalation paths and monitoring dashboards in place.

Industry Applications

Healthcare

Patient intake, appointment scheduling, insurance pre-verification, prescription refill routing, and after-hours triage.

Handles intake for 500+ patients/day without staff

Financial Services

Account balance queries, fraud alert confirmation, loan status, and appointment booking with compliance-aware design.

Compliance-aware conversation design built in

E-commerce

Order status, returns initiation, product questions, and escalation to live agents for complex issues.

68% of calls resolved without human intervention

Property Management

Maintenance request intake, rent payment queries, lease renewal scheduling, and emergency routing.

24/7 tenant support with zero overnight staffing

Automotive

Service appointment booking, recall notifications, parts availability, and dealer routing for inbound calls.

Booking conversion rate up 34% vs. hold queue

Outbound Sales

High-volume outbound campaigns, lead qualification, appointment setting, and warm transfer to closers.

1,400 concurrent outbound calls at peak

Frequently Asked Questions

How long does it take to build a voice AI agent?

The PoC is done in 5 days. A production deployment with CRM integration, telephony setup, and QA takes 2-4 weeks depending on complexity. We've shipped straightforward inbound builds in 10 days. Complex multi-tenant outbound systems have taken 6 weeks. The scoping call on day one tells you which yours will be.

Will it actually sound human, or will my customers know they're talking to AI?

With ElevenLabs voice cloning and proper VAD tuning, most callers don't flag it in blind tests. What matters more than voice quality is response latency and conversation quality. An agent that responds in under 400ms, handles interruptions gracefully, and gives useful answers reads as human even if the voice isn't perfect. The voice is only part of the story.

What's the difference between Vapi, LiveKit, and Retell AI?

Vapi deploys fastest and has the most pre-built integrations. Retell has the best out-of-the-box turn detection. LiveKit gives you full infrastructure control and is the right call for HIPAA, compliance-heavy use cases, or anything that outgrows a SaaS model. We'll recommend the right one in the first consultation.

How do you handle compliance for healthcare or financial services?

Healthcare builds run on LiveKit with end-to-end encryption. We don't use Vapi or Retell for HIPAA-sensitive call data. For financial services, we build compliance-aware conversation design with automatic flagging for regulatory trigger phrases and escalation rules. We've built this. It's not theoretical.

What happens when the AI doesn't know the answer?

We build explicit fallback paths. The agent doesn't loop "I'm sorry, I didn't understand that." It does something useful: connects to a human, schedules a callback, captures the question for follow-up, or acknowledges the gap and moves on. Fallback design is part of the conversation architecture.

Can you clone our existing brand voice?

Yes. We need 30+ minutes of clean audio from the speaker you want to clone. We use ElevenLabs Professional Voice Cloning. In blind listening tests, the result is indistinguishable from the original for the vast majority of listeners.

How does CRM sync and sentiment analysis work?

Sentiment analysis runs in real time throughout every call. At call end, the agent pushes a call summary, sentiment score, key facts captured, and full transcript to your CRM. Native integrations for Salesforce and HubSpot. Webhook flows via n8n or Zapier for everything else. If sentiment crosses a frustration threshold, the agent queues a warm transfer and the CRM record gets flagged.

What languages do you support?

20+ languages out of the box: English (all major accents), Spanish, French, German, Portuguese, Italian, Dutch, Polish, Japanese, Korean, Mandarin, Hindi, Arabic, and more. Deepgram Nova-3 handles STT for most languages. ElevenLabs covers 30+ languages for TTS.

What does 1,000+ concurrent calls mean in practice?

It means 1,000 separate live conversations at the same time. For outbound, that means dialing a list of 10,000 contacts and working through it in hours, not days. We've run load tests at 1,400 concurrent calls on LiveKit infrastructure without quality degradation. Horizontal scaling is built into the architecture from day one.

Can we run a PoC before committing to a full build?

Yes, this is standard. We build a working prototype in week one. You can call it, test it, and decide. If you don't move forward, you keep the prototype with no obligation. We offer this because we're confident in what we build.

Ready to Build?

Book a 30-minute call. We'll scope your use case, spec the right architecture, and have a working PoC on your desk inside a week. No slide decks. No proposals that go nowhere. Just a phone number you can call to talk to the thing we built.

Book a Free Consultation Start Your Free PoC