H
    Hestur
    Back to Blog

    How to Scale a Voice AI Agent

    4 min read

    Scaling a voice AI agent beyond 10,000 minutes per month requires three architectural shifts: moving to self-hosted infrastructure, implementing queue mode for concurrent calls, and adding active monitoring. Without these, platform costs become prohibitive and reliability degrades under load.

    Scaling a voice AI agent beyond 10,000 minutes per month requires three architectural shifts: moving to self-hosted infrastructure, implementing concurrent call handling, and adding production-grade monitoring. Without these, platform costs compound to $5,000–$16,000/month and reliability degrades under load.

    When does scale become a problem?

    Most voice AI deployments start on managed platforms (Vapi, Retell). These work well up to about 10,000 minutes/month — roughly 100 five-minute calls per day. Above that:

    • Cost: At $0.23–0.33/min, 25,000 minutes costs $5,750–$8,250/month. Self-hosted LiveKit at the same volume costs $1,750–$3,750/month.
    • Concurrency limits: Managed platforms cap concurrent calls. A spike during a marketing campaign or business hours rush can cause queue buildup.
    • Latency variance: Shared infrastructure means your latency varies with other tenants’ load.

    The scale architecture: three shifts

    Shift 1: Move to self-hosted LiveKit

    LiveKit is open-source and designed for high-concurrency real-time communication. Self-hosting on a VPS or cloud instance gives you:

    • Fixed infrastructure cost regardless of call volume
    • Full concurrency control
    • Data residency for regulated industries
    • Custom pipeline modifications

    Infrastructure requirements at different scales:

    | Volume | Recommended instance | Monthly cost |

    |—|—|—|

    | 10–25k min/mo | 4 vCPU, 8 GB RAM | $40–80 |

    | 25–100k min/mo | 8 vCPU, 16 GB RAM, load balancer | $100–200 |

    | 100k+ min/mo | Multiple workers, Redis queue, auto-scaling | $200–500 |

    Shift 2: Implement queue mode

    Queue mode separates call intake from call processing. A webhook receives the call, places it in a Redis queue, and workers pull from the queue to handle calls. This means:

    • Calls don’t fail during traffic spikes — they queue
    • Workers can auto-scale based on queue depth
    • Failed calls can be retried automatically

    Queue-based call handling architecture

    1. Webhook receives call → pushes to Redis queue

    2. Worker pool pulls from queue → processes call

    3. Dead letter queue captures failed calls → alert + retry

    Shift 3: Production monitoring

    At scale, you can’t review every call. You need automated monitoring:

    • Call success rate — what percentage complete without error
    • End-to-end latency — track P50, P95, P99 response times
    • Escalation rate — how often calls go to a human (too high = agent is confused)
    • Transcript anomaly detection — flag calls where the agent went off-script
    • Cost per call — track by use case and time of day

    Provider-level scaling considerations

    Speech-to-text at scale

    Deepgram handles concurrency well with a simple API key upgrade. At 25,000+ minutes/month, request a dedicated endpoint to guarantee throughput SLA.

    LLM inference at scale

    OpenAI and Anthropic both impose rate limits by tier. Check your rate limits early — at 100 concurrent calls each making 3–4 LLM requests per minute, you need a Tier 4+ OpenAI account. Alternatively, run a local LLM on your infrastructure for the high-volume calls and use GPT-4o only for complex intents.

    TTS at scale

    Cartesia and ElevenLabs both support high-concurrency. For very high volume, generate common response phrases as pre-cached audio files — “I’ll check availability for you” doesn’t need to be generated fresh on every call.

    Multi-agent architecture for high complexity

    At scale, a single agent often isn’t enough. High-volume deployments commonly use a supervisor pattern:

    • Router agent — handles call intake, identifies intent, routes to specialist agent
    • Specialist agents — one each for scheduling, billing, support, escalation
    • Supervisor — monitors the conversation, can override specialist, handles handoffs

    This pattern is particularly effective for contact centres where callers have diverse intents and a single agent can’t be expert in everything.

    Performance benchmarks to target

    | Metric | Acceptable | Good | Excellent |

    |—|—|—|—|

    | End-to-end latency | Under 800ms | Under 500ms | Under 300ms |

    | STT accuracy | 90%+ | 95%+ | 98%+ |

    | Call success rate | 90%+ | 95%+ | 99%+ |

    | Escalation rate | Under 30% | Under 20% | Under 10% |

    | First call resolution | 60%+ | 75%+ | 85%+ |

    Enjoyed this article?

    Subscribe to our newsletter for more AI automation insights.

    Back to Blog