Infrastructure & Ethics

Latency

How long an AI system takes to respond; often more important than raw quality for user experience.

In common use since 1960

Latency in AI engineering is how long the system takes to respond — measured from request submitted to first useful output, and to last token of complete output. For LLM and other generative AI products, latency is often the difference between "users love it" and "users abandon it", and tuning latency is one of the central craft skills of AI engineering in 2026.

The latency components in a typical LLM call:

  • Network round-trip — your server to the provider's API and back; tens to hundreds of milliseconds.
  • Queue time — waiting for capacity at the provider; can spike during load.
  • First-token time (TTFT) — how long until the model starts producing output. Heavily influenced by prompt length and prompt caching.
  • Token generation time — speed at which each subsequent token is generated; reported as tokens-per-second.
  • Final latency — first-token time plus (output_tokens / tokens_per_second).

The 2026 latency landscape:

  • Frontier APIs (OpenAI, Anthropic, Google) — typically 200ms–2s TTFT, 30–100 tokens/sec for flagship models; faster for smaller variants.
  • Groq — specialised LPU hardware achieves several hundred to thousands of tokens/sec on Llama and Mistral models; the speed leader in 2026.
  • Together, Fireworks, Replicate — competitive hosted-open-weights with 50–200 tokens/sec for typical models.
  • Self-hosted on H100/H200 — variable depending on optimisation; with vLLM or TensorRT-LLM properly tuned, 50–150 tokens/sec is typical for 70B-class models.
  • On-device — Apple Silicon Macs run 7B-class models at 30–50 tokens/sec; phones lag but are catching up.

The techniques that reduce latency:

  • Smaller models — use the smallest model that solves the task; usually delivers the biggest latency win.
  • Prompt caching — frontier APIs reuse encoded prompt prefixes; can cut TTFT by 50–90% on long shared prefixes.
  • Streaming — start showing output as it generates instead of waiting for completion; perceived latency drops dramatically.
  • Speculative decoding — small draft model proposes tokens, large model verifies in batches; 2–3x speedup on many workloads.
  • Quantisation — 4-bit or 8-bit weights run faster than 16-bit; quality cost is small with good recipes.
  • KV-cache reuse — keep attention state warm across follow-up turns in conversation.
  • Parallel tool calls — let the LLM call multiple tools at once instead of sequentially.
  • Caching at the application layer — semantic cache for repeated queries; can drop latency to milliseconds when hits.

The latency requirements that drive design choices:

  • Real-time voice agents — TTFT under 500ms total (STT + LLM + TTS); demands aggressive optimisation.
  • Chat interfaces — TTFT under 1–2s for "feels responsive"; streaming required for longer outputs.
  • Search and autocomplete — TTFT under 300ms; small models or aggressive caching.
  • Background agents — minutes to hours acceptable; latency rarely the bottleneck.
  • Batch processing — throughput matters more than latency; choose based on tokens-per-dollar at high concurrency.

For a US engineering team in 2026, latency is one of the central design constraints. The mature pattern is to budget latency end-to-end (network + queue + LLM + downstream + return), measure relentlessly, and optimise the longest pole. Many AI features that feel slow are slow because of one specific bottleneck — often a synchronous tool call inside the LLM loop — that is easy to fix once identified and impossible to improve without measurement.

Keep exploring

Looking for something else? The full glossary covers 120+ AI terms updated for 2026.

Open the glossary
Chat on WhatsApp