How to Scale a Voice AI Agent

Scaling a voice AI agent beyond 10,000 minutes per month requires three architectural shifts: moving to self-hosted infrastructure, implementing concurrent call handling, and adding production-grade monitoring. Without these, platform costs compound to $5,000–$16,000/month and reliability degrades under load.

When does scale become a problem?

Most voice AI deployments start on managed platforms (Vapi, Retell). These work well up to about 10,000 minutes/month — roughly 100 five-minute calls per day. Above that:

Cost: At $0.23–0.33/min, 25,000 minutes costs $5,750–$8,250/month. Self-hosted LiveKit at the same volume costs $1,750–$3,750/month.
Concurrency limits: Managed platforms cap concurrent calls. A spike during a marketing campaign or business hours rush can cause queue buildup.
Latency variance: Shared infrastructure means your latency varies with other tenants’ load.

The scale architecture: three shifts

Shift 1: Move to self-hosted LiveKit

LiveKit is open-source and designed for high-concurrency real-time communication. Self-hosting on a VPS or cloud instance gives you:

Fixed infrastructure cost regardless of call volume
Full concurrency control
Data residency for regulated industries
Custom pipeline modifications

Infrastructure requirements at different scales:

Volume	Recommended instance	Monthly cost
—	—	—
10–25k min/mo	4 vCPU, 8 GB RAM	$40–80
25–100k min/mo	8 vCPU, 16 GB RAM, load balancer	$100–200
100k+ min/mo	Multiple workers, Redis queue, auto-scaling	$200–500

Shift 2: Implement queue mode

Queue mode separates call intake from call processing. A webhook receives the call, places it in a Redis queue, and workers pull from the queue to handle calls. This means:

Calls don’t fail during traffic spikes — they queue
Workers can auto-scale based on queue depth
Failed calls can be retried automatically

Shift 3: Production monitoring

At scale, you can’t review every call. You need automated monitoring:

Call success rate — what percentage complete without error
End-to-end latency — track P50, P95, P99 response times
Escalation rate — how often calls go to a human (too high = agent is confused)
Transcript anomaly detection — flag calls where the agent went off-script
Cost per call — track by use case and time of day

Provider-level scaling considerations

Speech-to-text at scale

Deepgram handles concurrency well with a simple API key upgrade. At 25,000+ minutes/month, request a dedicated endpoint to guarantee throughput SLA.

LLM inference at scale

OpenAI and Anthropic both impose rate limits by tier. Check your rate limits early — at 100 concurrent calls each making 3–4 LLM requests per minute, you need a Tier 4+ OpenAI account. Alternatively, run a local LLM on your infrastructure for the high-volume calls and use GPT-4o only for complex intents.

TTS at scale

Cartesia and ElevenLabs both support high-concurrency. For very high volume, generate common response phrases as pre-cached audio files — “I’ll check availability for you” doesn’t need to be generated fresh on every call.

Multi-agent architecture for high complexity

At scale, a single agent often isn’t enough. High-volume deployments commonly use a supervisor pattern:

Router agent — handles call intake, identifies intent, routes to specialist agent
Specialist agents — one each for scheduling, billing, support, escalation
Supervisor — monitors the conversation, can override specialist, handles handoffs

This pattern is particularly effective for contact centres where callers have diverse intents and a single agent can’t be expert in everything.

Performance benchmarks to target

Metric	Acceptable	Good	Excellent
—	—	—	—
End-to-end latency	Under 800ms	Under 500ms	Under 300ms
STT accuracy	90%+	95%+	98%+
Call success rate	90%+	95%+	99%+
Escalation rate	Under 30%	Under 20%	Under 10%
First call resolution	60%+	75%+	85%+