Low-Latency VideoStreaming over 5G

6 min readAug 9, 2024

Overview

Live video streaming traffic is predicted to account for 29.7% of all Internet traffic by the end of 2023 due to the growing popularity of live video streaming services including live event streaming, shoppable social live streaming, and esports and gaming.

To enable real-time interaction, these applications require minimal end-to-end latency (also known as glass-to-glass latency).

Latency is defined as the time delay between video capture and the actual playback at the client, imposing tight content delivery constraints.

Throughput estimation for VoD vs Streaming

Video streaming clients are tasked with adapting the requested video bitrate to the available throughput, with the objective of ensuring highquality uninterrupted streaming.

To address the QoE (Quality of Experience) optimization challenge, most video streaming applications typically require accurate throughput estimation, in order to adapt their streaming bitrate to the current network conditions, thus ensuring uninterrupted high-quality streaming.

Throughput estimation is typically based on application layer signals that
are updated every few seconds (weighted-averaged throughput
probes) and has proven to be sufficient for Video on Demand (VoD) streaming.

However, low-latency streaming bearing an additional delivery objective, namely that of timeliness, not only requires more frequent network-condition information updates but is also known to have challenges with conventional throughput estimation techniques, due to chunked
transfer encoding.

MPEG-DASH — Dynamic Adaptive Streaming over HTTP (Latency 10–45 sec)

MPEG-DASH (Dynamic Adaptive Streaming over HTTP) is an industry-wide adopted video streaming standard, that is compatible with all popular video codecs such as H.264, H.265, HVEC, and end-user devices.

An origin server transcodes the source video content in multiple representations (resolutions/bitrates) and then proceeds to segment each representation into smaller files, namely segments, that are typically 2–10 s in duration.

The way the content has been organized in representations and segments is described in a manifest file called an MPD (media presentation description) that is advertised to the video client at the initiation of the streaming session, along with the latency targets set for the particular video stream.

The client then proceeds to request every segment sequentially, according
to its adaptive bitrate (ABR) module, i.e., an optimization function that decides the representation for every segment.

The segments are downloaded at a temporary queue before decoding and playout, known as the buffer.

DASH has been proven to be ideal for both Video On Demand (VoD) and live streaming, offering high-quality streaming with latency in the range of 10–45 s.

Note: HLS is another popular protocol.

Chunked transfer encoding (CTE) with MPEG Common Media Application Format (CMAF) (Latency < 5 sec)

Until recently, any content distributor wanting to reach users on both Apple and Microsoft devices had to encode and store the same data twice. This made reaching viewers across iPhones, smart TVs, Xboxes, and PCs both costly and inefficient.

Apple and Microsoft recognized this inefficiency. So, the two organizations went to the Moving Pictures Expert Group with a proposal. By establishing a new standard called the Common Media Application Format (CMAF), they would work together to reduce complexity when delivering video online.

CMAF replaces these four sets of files with a single set of audio/video MP4 files and four adaptive bitrate manifests.

CMAF in itself is a media format. But by incorporating it into a larger system aimed at reducing latency, leading organizations are moving the industry forward.

This requires two key behaviors:

Let’s clarify some of the terminology before diving any deeper.

A chunk is the smallest unit.
A fragment is made up of one or more chunks.
A segment is made up of one or more fragments.

CTE is one of the main features of HTTP/1.1 that allows the delivery of a segment in small pieces called chunks. The fundamental reason behind this development lies in that, for legacy DASH, the origin server has to wait for an entire segment to be encoded and packaged before advertising that
segment to the client. This preparation process requires at least one segment delay.

With segment sizes in 2–10 s and DASH clients requiring multiple segments downloaded even before playback starts, the added delay to the latency makes low latency infeasible.

In contrast, a chunk can be as small as a single frame. Therefore, it can be delivered to the client in near real-time, even before the segment is fully encoded and available on the hosting server side.

DASH vs CMAF

Buffer size and latency lag

To achieve low-latency streaming, per design, video clients allow a very short maximum buffer, typically in the order of 4 s.

As video chunks arrive at the client, they are temporarily stored in the buffer queue to be consumed in order at a rate that equals unity, at a playback rate of 1, i.e., 1 s of video content is played back every 1 s of actual time.

Therefore in a scenario where the content arrival rate is larger than 1 (requested bitrate lower than available throughput), the buffer queue would inadvertently grow, and in turn, so would the latency lag (content ages as it remains in the buffer).

Nonetheless, shorter buffers naturally offer shorter cushions
against throughput variation and/or wrong bitrate adaptation
decisions.

For instance, in a case of a sudden drop in the available throughput, the low-latency video client has a very short reaction window, before the buffer runs out of video data, an event that practically constitutes a stall. To recover from a stall the client is required to fill its queue up to a
minimum buffer value (typically in the order of 1 s) before playback can resume — all the while falling behind in latency.

Ways to alleviate this added latency

The first pertains to requesting the most recent content from the origin
server, after a stall, practically skipping a part of the video sequence with direct implications to the continuity of the streaming experience.
The second is to employ a playback rate higher than 1 (speed up) until the latency lag is minimized, with implications for the streaming experience again, and with an increased probability for a ”back-to-back” stall.
The third way is to avoid rebuffering altogether by employing
a conservative adaptation policy and a fast reaction time –
faster than the time required to register a sudden throughput
drop in the throughput estimation module of typical video
players (windowed approach).

Thus, to ensure good QoE for users, it is imperative for the client device to react to dynamic network conditions.