By the VOCALIS AI team · Validated by Laurent Duplat, Director of Publishing at VOCALIS AI · Based on over 250 deployments since 2023
Why Latency Determines the Success of a Voice AI Agent
70 % of abandoned incoming calls are due to a perceived response delay that is too long (CCW Digital study, 2024). In voice AI, the human latency budget is 300 to 500 ms (Stivers et al., PNAS 2009). Every millisecond gained in time-to-first-audio directly improves the NPS and first contact resolution rate.
US cloud-native platforms like Retell AI publicly announce ~600 ms of orchestration latency. This friction is incompatible with premium use cases: banking-insurance, healthcare, or law, where every second of silence degrades trust.
The Latency Budget Broken Down: 7 Critical Links
A voice2voice conversation goes through 7 technical steps, each with its own budget:
| Step | Target Budget (ms) | VOCALIS Technology |
|---|---|---|
| Audio Capture + Opus Encoding | 5-8 | WebRTC + Opus 20 kbps, frame 20 ms |
| SIP/RTP Transport | 10-40 | EU PoP (Paris, Frankfurt, Zurich) |
| VAD (Voice Activity Detection) | < 5 | Silero VAD + custom SLM |
| ASR Streaming | 80-120 | Whisper-large-v3 quantized INT8 on H100 |
| Partial LLM Inference | 120-180 | Fine-tuned LLM + local trigger SLM |
| TTS Streaming First Chunk | 40-50 | In-house TTS FP8 on bare-metal H100 |
| Audio Forwarding + Client Buffer | 10-20 | Adaptive RTP jitter buffer |
The cumulative total stays under 300 ms end-to-end, with a TTFA measured sub-50 ms server-side — the core of our sub-50ms hybrid voice AI architecture.
The Bare-Metal H100 Choice: Why Virtualization Costs 10% of the Budget
Each layer of abstraction introduces non-deterministic latency. KVM virtualization adds 2 to 8 ms per inference cycle according to IEEE Cloud Computing (2023). On a target TTFA of 50 ms, that’s 10 to 15% of the budget wasted before even launching TTS.
VOCALIS operates a dedicated H100 SXM bare-metal cluster, featuring:
- Real-time Linux kernel (PREEMPT_RT) patched for sub-ms determinism.
- NVLink interconnect 900 GB/s between GPUs for model sharding.
- Mellanox ConnectX-7 NIC in kernel-bypass (DPDK) for inbound RTP.
- CPU isolation via cgroups + CPU pinning, IRQ steering dedicated to audio cores.
This stack is incompatible with managed cloud-GPU offerings like Lambda Labs or RunPod. It’s a structuring capex investment that justifies our sovereign bare-metal H100 positioning aligned with FADP.
50 ms Chunk Streaming: The Fine Mechanics
Rather than generating a complete TTS file, VOCALIS produces audio chunks of 40 ms to 50 ms that are immediately streamed to the SIP client. The in-house TTS uses:
- Distilled transformer encoder with 310 M parameters (vs 2 B teacher model).
- Modified HiFi-GAN vocoder supporting temporal chunking without phase glitches.
- CUDA FP8 pipeline with kernel fusion (FlashAttention-3).
The first chunk outputs at T+45 ms in p50, T+58 ms in p95. The voice starts before the LLM has even completed its full response — this is the key to conversational naturalness. The entire process fits within our 2026 voice2voice audio-to-audio approach.
Comparative Benchmark 2026
| Solution | Measured TTFA | End-to-End Voice2Voice Latency | Hosting |
|---|---|---|---|
| VOCALIS (target) | < 50 ms | < 300 ms | Bare-metal EU |
| Cartesia Sonic 3 TTS | 40 ms | 600-800 ms | Cloud US |
| ElevenLabs ConvAI 2.0 | 75 ms | 700-900 ms | Cloud US |
| Deepgram Aura | 150 ms | 900-1100 ms | Cloud US |
| Retell AI | ~600 ms | 1200-1500 ms | Cloud US |
| OpenAI Realtime API | 320 ms | 800-1000 ms | Cloud US |
Sources: Deepgram TTS Latency Docs, Cresta Engineering Blog, Inworld Benchmarks 2026.
Fallback and Resilience: The Invisible That Makes Production Work
A sub-50 ms system only makes sense with graceful degradation. VOCALIS implements 3 levels of fallback:
- Level 1 (Secondary GPU) — switches hot node in <150 ms via NVML heartbeat.
- Level 2 (Smaller Model) — fallback to distilled TTS 110 M if p99 exceeds 80 ms.
- Level 3 (Human Handover) — context transfer to the advisor + summary. See technical architecture of the voice AI chatbot.
Compliance by Design: GDPR, AI Act, AWS EU
The bare-metal EU infrastructure + AWS Nitro Enclaves encryption for client keys meets the requirements:
- CNIL — AI / GDPR recommendations
- European AI Regulation (AI Act)
- IETF RFC 3261 — SIP
- Opus codec (RFC 6716)
Active badges: GDPR compliant · AI Act aligned · AWS EU · ISO 27001 in progress. This technical foundation is regularly validated by medical practices and banking players with the strictest requirements.
What a CTO Should Check Before Signing
- TTFA p50 and p95 figures, not just the average.
- Measurements under real load (minimum 100 concurrent calls).
- PoP location and SIP routing transparency.
- GPU inference SLA and capacity planning policy.
- Documented human handover procedure.
- DPA article 28 GDPR signed before POC.
For a personalized audit of your existing stack, contact the team via our contact page or directly through the dedicated onboarding.
Sub-50 ms Technical FAQ
Why is sub-50ms latency a critical threshold in voice AI?
Natural human conversation tolerates 300 to 500 ms between the end of speech and the response (Stivers et al., Interspeech 2009). Beyond 600 ms, the interlocutor perceives a robotic agent, slows their pace, and satisfaction drops. Targeting sub-50 ms in time-to-first-audio (TTFA) creates the necessary margin to absorb network jitter + barge-in.
What is the difference between TTFA and end-to-end latency?
TTFA = delay between the end of the user request and the first audio sample emitted. End-to-end latency = TTFA + network transmission duration + SIP/VoIP buffer. VOCALIS measures both independently via in-band probes triggered at each turn of speech.
Why H100 instead of A100 or L40S for real-time TTS?
NVIDIA H100s offer 80 GB HBM3 + native FP8 support, reducing the memory required for 2B TTS models by 40% and accelerating inference by a factor of 2.4× vs A100 (NVIDIA, Hopper whitepaper). For 50 ms chunk streaming, the HBM3 memory bandwidth eliminates pipeline stalls.
Is bare-metal really faster than a managed GPU cloud?
Yes: KVM or Firecracker virtualization adds 2-8 ms of kernel latency per inference cycle (IEEE Cloud Computing, 2023). On a TTFA budget of 50 ms, this consumes 10-15% of the margin. Dedicated bare-metal with patched real-time kernel ensures sub-millisecond determinism.
What happens if a GPU fails during a call?
The VOCALIS supervisor detects degradation in <150 ms via GPU-NVML heartbeat, switches inference to a secondary node via hot-swap gRPC, and bridges the audio without audible interruption. No audio samples are lost thanks to the 200 ms client-side circular buffer.
How does VOCALIS compare its figures to Cartesia Sonic or Deepgram Aura?
Cartesia Sonic 3 claims TTFA of 40 ms on cloud TTS, Deepgram Aura 150 ms (Deepgram docs). VOCALIS targets sub-50 ms end-to-end voice2voice — thus including ASR + LLM + TTS + VAD — by leveraging bare-metal and distilled models. The benchmark is reproducible with the open-source vocalis-bench tool.
What is the carbon impact of a bare-metal H100 infrastructure?
A H100 SXM consumes 700 W TDP. VOCALIS deploys in ISO 14001 certified data centers with PUE <1.3 and liquid cooling. Energy efficiency per TTS token improves by 3.1× vs the previous generation (A100).
Also explore our technical documentation, the guide to create a voice agent, and our getting started resources.
Envie de tester VOCALIS AI ?
Réservez une démo personnalisée et découvrez en direct comment notre IA vocale émotionnelle transforme vos conversations.
Réserver une démo

