Scaling a voice AI agent beyond 10,000 minutes per month requires three architectural shifts: moving to self-hosted infrastructure, implementing concurrent call handling, and adding production-grade monitoring. Without these, platform costs compound to $5,000–$16,000/month and reliability degrades under load.
When does scale become a problem?
Most voice AI deployments start on managed platforms (Vapi, Retell). These work well up to about 10,000 minutes/month — roughly 100 five-minute calls per day. Above that:
- Cost: At $0.23–0.33/min, 25,000 minutes costs $5,750–$8,250/month. Self-hosted LiveKit at the same volume costs $1,750–$3,750/month.
- Concurrency limits: Managed platforms cap concurrent calls. A spike during a marketing campaign or business hours rush can cause queue buildup.
- Latency variance: Shared infrastructure means your latency varies with other tenants’ load.
The scale architecture: three shifts
Shift 1: Move to self-hosted LiveKit
LiveKit is open-source and designed for high-concurrency real-time communication. Self-hosting on a VPS or cloud instance gives you:
- Fixed infrastructure cost regardless of call volume
- Full concurrency control
- Data residency for regulated industries
- Custom pipeline modifications
Infrastructure requirements at different scales:
| Volume | Recommended instance | Monthly cost |
|—|—|—|
| 10–25k min/mo | 4 vCPU, 8 GB RAM | $40–80 |
| 25–100k min/mo | 8 vCPU, 16 GB RAM, load balancer | $100–200 |
| 100k+ min/mo | Multiple workers, Redis queue, auto-scaling | $200–500 |
Shift 2: Implement queue mode
Queue mode separates call intake from call processing. A webhook receives the call, places it in a Redis queue, and workers pull from the queue to handle calls. This means:
- Calls don’t fail during traffic spikes — they queue
- Workers can auto-scale based on queue depth
- Failed calls can be retried automatically
Queue-based call handling architecture
1. Webhook receives call → pushes to Redis queue
2. Worker pool pulls from queue → processes call
3. Dead letter queue captures failed calls → alert + retry
Shift 3: Production monitoring
At scale, you can’t review every call. You need automated monitoring:
- Call success rate — what percentage complete without error
- End-to-end latency — track P50, P95, P99 response times
- Escalation rate — how often calls go to a human (too high = agent is confused)
- Transcript anomaly detection — flag calls where the agent went off-script
- Cost per call — track by use case and time of day
Provider-level scaling considerations
Speech-to-text at scale
Deepgram handles concurrency well with a simple API key upgrade. At 25,000+ minutes/month, request a dedicated endpoint to guarantee throughput SLA.
LLM inference at scale
OpenAI and Anthropic both impose rate limits by tier. Check your rate limits early — at 100 concurrent calls each making 3–4 LLM requests per minute, you need a Tier 4+ OpenAI account. Alternatively, run a local LLM on your infrastructure for the high-volume calls and use GPT-4o only for complex intents.
TTS at scale
Cartesia and ElevenLabs both support high-concurrency. For very high volume, generate common response phrases as pre-cached audio files — “I’ll check availability for you” doesn’t need to be generated fresh on every call.
Multi-agent architecture for high complexity
At scale, a single agent often isn’t enough. High-volume deployments commonly use a supervisor pattern:
- Router agent — handles call intake, identifies intent, routes to specialist agent
- Specialist agents — one each for scheduling, billing, support, escalation
- Supervisor — monitors the conversation, can override specialist, handles handoffs
This pattern is particularly effective for contact centres where callers have diverse intents and a single agent can’t be expert in everything.
Performance benchmarks to target
| Metric | Acceptable | Good | Excellent |
|—|—|—|—|
| End-to-end latency | Under 800ms | Under 500ms | Under 300ms |
| STT accuracy | 90%+ | 95%+ | 98%+ |
| Call success rate | 90%+ | 95%+ | 99%+ |
| Escalation rate | Under 30% | Under 20% | Under 10% |
| First call resolution | 60%+ | 75%+ | 85%+ |