How to Reduce Latency in LiveKit Applications

Data transfer in real-time communication is expected to be almost instantaneous. The problem is that on voice, video, or data streaming platforms, even a small delay can completely disrupt the user experience. Such delays between when a user speaks and when they hear a system or another user respond create friction that disrupts the natural rhythm of conversation and reduces overall faith in the platform.

LiveKit provides an open-source, WebRTC-based foundation for creating these real-time applications. It is high-performance and highly scalable out of the box. But sub-second responsiveness means developers have to tune a whole lot of different parts of the application stack. Network conditions, server geography, codec settings, and media pipeline configurations all come together and, eventually, define the speed of your audio and video.

In this article, we review the basic principles of delay in streaming real-time data and of LiveKit latency optimization. Find out how to detect bottlenecks, tune your media pipelines, and use advanced techniques to keep your app running at peak performance across different network scenarios.

Read on!

What “Latency” Means in LiveKit

The latency in a LiveKit app is the time it takes for a media packet (e.g., a video frame or an audio segment) to travel from the source device to the destination device for decoding by the end user. It's often referred to as glass-to-glass latency in video applications or mouth-to-ear latency in audio.

But in LiveKit terms, this is not a single latency measurement but the sum of several processing stages. The media must be captured, encoded, and packetized by the sending device. Then, the packets need to be transmitted over the internet to the LiveKit server, which processes and routes them to the appropriate receiving clients. Buffer, decode, render. The receiving device also has to buffer, decode, and render.

You need to understand how this pipeline works to reduce latency in LiveKit at specific stages. A finely tuned video codec will not address server placement delays, just as a hyper-local server cannot address the processing delay incurred from an over-complicated artificial intelligence pipeline.

Key Latency Contributors: Network RTT, Jitter, and Packet Loss

Network behavior is the most variable factor in real-time communication. Even with perfectly optimized code, the physical realities of the internet dictate how fast data can travel.

Below are three primary network metrics that dictate the speed and reliability of your LiveKit application.

Round Trip Time (RTT)

Round Trip Time measures the time it takes for a packet of data to travel from the user's device to the server and back again. High RTT directly increases the baseline delay of your application. RTT is heavily influenced by the physical distance between the client and the LiveKit server, as well as the efficiency of the routing paths taken by Internet Service Providers (ISPs).

Jitter

Jitter is the variability in packet arrival time. Since Internet traffic is routed dynamically, packets sent at regular intervals may arrive sporadically at the destination.WebRTC clients use a jitter buffer to batch early arrived packets until late packets arrive. This enables the packets to be played out smoothly, without audio dropout or video freeze frames. However, larger jitter buffers also imply increased latency. High network jitter forces the client to increase the buffer size, artificially delaying the media playback.

Packet Loss

Packet loss is the failure of one or more data packets to reach their destination during transmission over a computer network, typically caused by network congestion or hardware malfunctions. In the context of a real-time WebRTC flow over UDP, lost-packet recovery is mandatory, either through FEC or through retransmission based on NACKs. Both recovery processes require some time. High packet loss forces the application to spend critical milliseconds reconstructing the media stream, heavily degrading responsiveness.

Choosing the Right LiveKit Region and Geo-Distribution

Since physical distance determines network RTT, the location of your LiveKit servers is one of the most important architectural decisions you will ever make.

If your users are highly localized, running a single LiveKit instance in a data center near that region can deliver great performance. However, in a global application, a single region will just route distant users over high-latency network routes across continents.

When optimizing livekit audio streaming, developers use geo-distributed deployments. LiveKit Cloud users are automatically connected to the edge server closest to their location. If you are self-hosting, you can do the same by running a distributed mesh of LiveKit nodes in multiple regions. Since the “first mile” of the user’s connection (the move from their home router to the server) is as short as it can be, you reduce your chances of packet loss, jitter, and high RTT.

ICE, TURN, and NAT Traversal Optimization

Before media can flow between a user and a LiveKit server, WebRTC must establish a connection through a process called Interactive Connectivity Establishment (ICE). This process involves negotiating firewalls and Network Address Translators (NATs).

The Impact of TURN Servers

When enterprise firewalls are too strict to allow UDP connections to be established directly, WebRTC falls back to using Traversal Using Relays around NAT (TURN) servers to relay media traffic over TCP or TLS. Trusting TURN servers naturally adds a network hop, which increases RTT. In addition, TCP connections are subject to "head-of-line blocking", in which a lost packet causes the entire stream to be halted until it is retransmitted.

Optimizing the Traversal Process

To reduce latency in LiveKit, ensure your infrastructure allows direct UDP connections over standard WebRTC ports whenever possible. When TURN is strictly necessary, deploy TURN servers geographically close to your end-users. LiveKit provides built-in TURN capabilities that can be distributed at the edge, ensuring that even relayed connections take the most efficient path possible.

Building with LiveKit or WebRTC and struggling with latency? Book a 30-min consultation

Request a free call

Audio-First vs Audio+Video Pipelines

The amount of data you send directly affects processing time and network congestion. Video requires much more bandwidth and CPU than audio.

When optimizing livekit audio streaming in an app whose core value is voice communication (e.g., an AI voice agent or a customer support dialer), go for an “audio-first” pipeline. You can also disable video tracks or prioritize audio packets on your network to avoid network congestion that interrupts the voice conversation.

For audio and video applications, set the LiveKit bandwidth management mechanism to adaptively degrade video resolution before compromising audio quality. Human tolerance for blurry video is really high, but lose just (and I mean just) a few milliseconds of audio, and not only is the speech unintelligible, but it also kills the conversational flow.

Codec Choices and Bitrate Configuration

Encoding media translates raw audio and video into compressed packets suitable for transmission. The choice of codec and the target bitrate heavily influence both processing time and network utilization.

Optimizing Audio with Opus

WebRTC relies on the Opus codec for audio. Opus is highly versatile, supporting everything from low-bitrate speech to high-fidelity stereo music. For voice-centric use cases, set Opus to a lower bitrate (e.g., 24-32 kbps) and enable speech profiles. This reduces the payload size of packets, making them less susceptible to network congestion while maintaining high vocal clarity.

Selecting the Right Video Codec

As for video, developers usually have a choice of a small number of codecs, including VP8, H.264, and later codecs such as VP9 and AV1.

VP8 and H.264 are well supported by hardware acceleration, significantly reducing CPU utilization for encoding/decoding on the client side.
VP9 and AV1 provide better compression (lower bandwidth at higher CPU usage). If the client device cannot encode AV1 in real time, the processing delay will defeat the purpose of using a codec with lower bandwidth requirements.

Measuring Perceived vs Technical Latency

Technical latency—the precise millisecond count of network transit and processing time—is only one aspect of the equation. Perceived latency is how the user actually experiences the delay.

This contrast is critical, especially for real-time AI voice agents. A voice agent pipeline consists of a LiveKit WebRTC connection, Speech-to-Text (STT) transcription, a Large Language Model (LLM) for processing, and Text-to-Speech (TTS) synthesis.

Traditional LLMs, however, can take almost a full second to produce a response, according to WebRTC infrastructure experts. To reduce the perceived delay, developers can use a parallel execution strategy. By employing a fast-response Small Language Model (SLM) alongside a more advanced Large Language Model, the application can generate immediate "filler" responses or quick replies in a fraction of the time. The SLM starts streaming tokens to the TTS engine immediately, most of the time, reducing Time-to-First-Byte (TTFB). The user experiences the agent as extremely responsive, even if the overall technical latency of the LLM's full response remains high.

Taking Control of Your LiveKit Infrastructure

Optimizing LiveKit audio streaming is not a set-and-forget thing; it is a cycle of observing, tinkering, and scaling. When you monitor your network metrics in real time (for example, in Grafana or LiveKit telemetry dashboards), you can identify bottleneck nodes.

Start by making sure your servers are geographically close to your users. Then audit your codec settings, implement smart bandwidth management, and optimize any AI pipelines you have. When you control every aspect of the WebRTC transmission process, you can create robust, real-time applications with an excellent end-user experience.

If you need assistance with optimizing livekit audio streaming, we are here to help. Clover Dynamics is a provider of LiveKit integration services, including custom video and audio app development, LiveKit architecture design, web and mobile Integration, end-to-end encryption, custom UI for real-time interactions, and more. See our full list of services and contact us to start building your solution today!