YouTube system design

Dilip Kumar
9 min readJul 20, 2024

--

YouTube allows content creator to upload videos and viewers to click play. Goal is to support monthly active 2+ billion of user.

High level simple design

What problem we have with this simple design?

Daily active users: 2 billion (800 million)
No of videos watched by user daily: 5
Total no of videos watched daily: 800 * 5 million ~ 4 billion
No of users watching video concurrently: 10% ~ 400 million
Video download speed per user: 3MB per second
Total video download speed : 3 MB/s * 400 million ~ 1200 TB/s
Limit per single server to deliver data: 1–2GB/s (SSD) and 300MB/s for spinning disk.

As we see here, video download speed needed is 1200 TB/s but a single server can only deliver video at 1GB/s.

Optimize video stream service with horizontal scaling

We can horizontally scale streaming service based on videoId as below.

Based on server limit to stream 1GB/s we can only store limited videos on each server. This will fix the limitation we discussed with previous approach.

Though we have solved the limitation identified but this design does not scale well for hot videos.

Similar to streaming rate limitation for each server, there is limit to serve the concurrent users. For hot videos, only x users can be served by shard hosting the requested video.

Optimize video stream service with vertical sharding

We can split a single video into small chunks of size120MB . Each chunks will be separately stored on different shard. Chunks details for video will be stored into Chunks table.

Vertical sharding helps splitting traffic on same video across the various shards. Client will pass both movieId and offset to download the chunks of video needed to play the video.

This optimize the video streaming. But If movie is too hot then single server hosting that chunk will not be able to handle the load.

// TODO:: Talk about Persistent connection approach and also drawback of persistent connection

// TODO:: CDN helps to solve above issue. This is the reason we don’t serve video from the direct server.

Optimize video stream service with vertical scaling followed by horizontal scaling

To handle the throuput, we can add the replicas for each chunk server. It will helps to distribute the load on n-number of servers to stream the chunk.

We improved the video streaming. But this still suffers latency issue if a popular move is concurrently being watched across the geo locations.

Use of CDN/Edge Server vs Origin Server

Instead of origin server, we can use the Akami (or some different tech stack) to publish the video chunks across the world with help of CDN/Edge server.

It connects user to the nearest available CDN server and stream chunks from there. CDN does manage the cache mechanism to keep the hot chunks on CDN. In case of cache miss, it will read from origin server and update the cache.

CDN uses LRU cache strategy to manage the cache. It helps to keep the hot videos in the cache and get rid of least used video from the cache.

Smart Client Player to utilize chunk design

Client video player can be optimize to leverage the server design to optimize the stream api as below.

  1. Video player uses stream api
  2. Stream api will have videoId as well as chunkId
  3. Client will make API call to fetch initial data
  4. Then keep making next batch of API and load in memory
  5. If memory is full then keep erasing

Adaptive video player based on Network Speed

Same device on different network will receives video streams at different rate. For example same video gets downloaded faster on 5G network but slower on 2G network. This causes slowness or buffering issue to video player and interrupted user experience to watch the video.

Adaptive video player is designed to keep track of requested bitrate and throughput received by streaming server.

Server throughput estimation is typically based on application layer signals that are updated every few seconds (weighted-averaged throughput
probes) and has proven to be sufficient for Video on Demand (VoD) streaming.

However same approach doesn’t scale well on Online video streaming.

Let’s go through few video streaming standard which helps a low-latency streaming services.

DASH — Dynamic Adaptive Streaming over HTTP (Latency 10–45 sec)

DASH (developed by MPEG) is an industry-wide adopted video streaming standard, that is compatible with all popular video codecs such as H.264, H.265, HVEC, and end-user devices.

An origin server transcodes the source video content in multiple representations (resolutions/bitrates) and then proceeds to segment each representation into smaller files, namely segments, that are typically 2–10 s in duration.

The way the content has been organized in representations and segments is described in a manifest file called an MPD (media presentation description) that is advertised to the video client at the initiation of the streaming session, along with the latency targets set for the particular video stream.

The client then proceeds to request every segment sequentially, according
to its adaptive bitrate (ABR) module, i.e., an optimization function that decides the representation for every segment.

DASH has been proven to be ideal for both Video On Demand (VoD) and live streaming, offering high-quality streaming with latency in the range of 10–45 s.

HLS (HTTP Live Streaming)

HLS was developed by Apple and is designed to deliver live and on-demand video content over HTTP. It breaks down video into small segments (TS files) that can be streamed over HTTP, making it ideal for adaptive bitrate streaming, where the quality of the video is adjusted based on the viewer’s network conditions.

Chunked transfer encoding (CTE) with MPEG Common Media Application Format (CMAF) (Latency < 5 sec)

Until recently, any content distributor wanting to reach users on both Apple and Microsoft devices had to encode and store the same data twice. This made reaching viewers across iPhones, smart TVs, Xboxes, and PCs both costly and inefficient.

Apple and Microsoft recognized this inefficiency. So, the two organizations went to the Moving Pictures Expert Group with a proposal. By establishing a new standard called the Common Media Application Format (CMAF), they would work together to reduce complexity when delivering video online.

CMAF replaces these four sets of files with a single set of audio/video MP4 files and four adaptive bitrate manifests.

CMAF in itself is a media format. But by incorporating it into a larger system aimed at reducing latency, leading organizations are moving the industry forward.

This requires two key behaviors:

  1. Chunked Encoding
  2. Chunked Transfer Encoding

Let’s remap some of the terminology before diving any deeper.

  • A chunk is the smallest unit.
  • A fragment is made up of one or more chunks.
  • A segment is made up of one or more fragments.

With segment sizes in 2–10 s and DASH clients requiring multiple segments downloaded even before playback starts, the added delay to the latency makes low latency infeasible.

In contrast with CTE, a chunk can be as small as a single frame. Therefore, it can be delivered to the client in near real-time, even before the segment is fully encoded and available on the hosting server side.

Following shows the difference between DASH and CMAF.

Device based video rendering

Devices produced by Microsoft or Google or Apple etc uses their own video format.

Microsoft

  • Windows: MP4, MOV, AVI, WMV, MKV, FLV, and more
  • Xbox: MP4, MKV, AVI, WMV

Android

  • MP4: The most widely supported format, often used for videos from smartphones and cameras.
  • MOV: Another popular format, often used for videos from Apple devices.
  • AVI: A common format for older videos.
  • MKV: A versatile format that can support multiple audio and video tracks.
  • FLV: A format commonly used for online videos.

Apple

  • MP4: The primary format used for videos on Apple devices.
  • MOV: Another popular format, often used for videos from Apple devices.
  • HEVC: A high-efficiency video coding format that can offer better compression and quality.

Video Codec

A video codec (coder-decoder) is a method used to compress and decompress video data. It determines how the video is encoded, which affects its quality, file size, and compatibility with different devices and software.

Examples of common video codecs: H.264 (AVC), H.265 (HEVC), VP9, AV1, MPEG-4

Handle Audio and Video data

Video streaming platforms typically use a combination of techniques to handle audio and video data efficiently and deliver a smooth viewing experience. Here’s a breakdown of the key processes involved:

Encoding

  • Audio and Video Compression: The original audio and video data is compressed into a smaller, more manageable format. This is done using codecs like H.264, H.265, or VP9 for video and AAC, MP3, or Opus for audio.
  • Adaptive Bitrate Streaming (ABR): The video is often encoded at multiple bitrates to cater to viewers with different network speeds. ABR allows the platform to dynamically adjust the bitrate being delivered based on the viewer’s network conditions.

Packaging

  • Segmenting: The compressed audio and video data is segmented into smaller chunks (usually a few seconds long). This makes it easier to stream over the network and allows for adaptive bitrate streaming.
  • Container Formats: These segments are packaged into container formats like MP4 or TS. The container format defines how the audio and video data are stored and synchronized within the file.

Streaming Protocol

HTTP-based Protocols: Most video streaming platforms use HTTP-based protocols like HTTP Live Streaming (HLS) or Dynamic Adaptive Streaming over HTTP (DASH). These protocols allow for efficient delivery of video segments over the internet. CTE is new format but still in adoption phase.

Delivery

  • Content Delivery Network (CDN): To reduce latency and ensure smooth playback, video streaming platforms use CDNs. CDNs distribute the video content across multiple servers located geographically closer to the viewers.
  • Player: The viewer’s device (e.g., computer, smartphone, tablet) uses a video player to decode and render the streamed audio and video data.

Synchronization

Time Stamps: The audio and video segments are synchronized using timestamps. This ensures that the audio and video play in sync, preventing lip-sync issues.

Adaptive Bitrate Adjustments

Real-time Monitoring: The streaming platform continuously monitors the viewer’s network conditions. If the network speed drops, it can switch the viewer to a lower bitrate stream to prevent buffering. If the network improves, it can upgrade the viewer to a higher bitrate stream for better quality.

How Ad are inserted into Live Stream vs Video on demand?

The process of inserting ads into live streams and video on demand (VOD) differs due to the inherent differences between these two types of content.

Live Streams

  • Dynamic Ad Insertion: Ads are typically inserted dynamically during the live stream, allowing for real-time targeting and customization.
  • Ad Breaks: Ad breaks can be scheduled in advance or triggered based on specific events within the live stream (e.g., after a commercial break, before a halftime show).

Challenges:

  • Latency: Ads need to be inserted quickly to avoid disrupting the viewer experience.
  • Synchronization: The ad must be synchronized with the live content to prevent interruptions.
  • Targeting: Targeting ads in real-time can be more complex than in VOD, as the audience may be fluctuating.

Video on Demand

  • Pre-roll, Mid-roll, and Post-roll Ads: Ads can be inserted before, during, or after the video content.
  • Static Ad Insertion: Ads are typically inserted statically during the video encoding process, making it easier to manage and schedule.

Challenges:

  • User Experience: Too many ads can negatively impact the user experience.
  • Ad Skipping: Users may skip ads if they find them intrusive.

Upload Flow System design

Following is high level video upload flow system design needed to produce files at different encoding level and to support different ABR.

Reference

Happy learning :-)

--

--

Dilip Kumar
Dilip Kumar

Written by Dilip Kumar

With 18+ years of experience as a software engineer. Enjoy teaching, writing, leading team. Last 4+ years, working at Google as a backend Software Engineer.

No responses yet