Rate limiter system design

13 min readJul 14, 2024

Every system is designed to handle the target traffic. But the system may receive traffic greater than the expected due to either high demand, or bad elements trying to put heavy traffic to take the site down.

In all these cases, the system should be designed to handle the unexpected traffic and make sure that it gracefully takes care of it without impacting other users or without impacting the resource over-utilization.

One of the standard approaches is to apply a rate limiter to handle the unexpected traffic. Goal here is to design a highly scalable rate limiter system to handle the YouTube.com like traffic.

Types of rate limiter

Rate limiter are applied at following layers and known as different names.

1. DDoS protection services

DDoS attack are typically Layer 3 (of the OSI model) attacks focus on network infrastructure. Attackers try to overwhelm a network with traffic to prevent legitimate users from accessing services.

Rate limiter for DDos attack can be handled by following approaches.

Throttle: You can enforce a maximum request limit per client or across all clients by throttling individual clients to a user-configured threshold.
Rate-based ban: You can rate limit requests that match a rule on a per-client basis and then temporarily ban those clients for a configured period of time if they exceed a user-configured threshold.

How to identify client?

A client for DDos attack can be identified based on following ways

ALL
IP
HTTP_HEADER
HTTP_COOKIE
REGION_CODE
USER_IP

Throttling traffic

The throttle action in a rule lets you enforce a per-client request threshold to protect backend services.

For example, you might set the request threshold to 2,000 requests within 1,200 seconds (20 minutes). If a client sends 2,500 requests within any 1,200 second period, approximately 20% of the client’s traffic is throttled until the permitted request volume is at or below the configured threshold.

Banning clients based on request rates

The rate_based_ban action in a rule lets you enforce a per-client threshold to temporarily ban clients that exceed the limit by applying the configured exceed_action for all requests from the client for a configurable time period.

For example, you might set the request threshold to 2,000 requests within 1,200 seconds (20 minutes). If a client sends 2,500 requests within any 1,200 seconds, it applies the exceed_action to traffic from that client exceeding the 2,000 request threshold until the full 1,200 seconds has elapsed and for an additional number of seconds that you set as the ban duration period. If the ban duration period is set to 3600, for example, traffic from the client would be banned for 3,600 seconds (one hour) beyond the end of the threshold interval.

Google Cloud Armor is a one of the good product to handle DDos attack.

2. Server level Rate limiter

Layer 7 rate limiter is applied at application level. Goal is to handle the resource usage of each task level. It can be either based on UserID or CustomerId or for all incoming rpc.

It typically handles rejected request as below

Blocking: Rejecting the request outright.
Throttling: Delaying or queuing the request.
Redirecting: Directing the client to a different resource or error page.

3. Quota Server

This is also Layer7 rate limiter. Goal is to apply preconfigured quota on individual userId or CustomerId. On exceeding limit, it handles the request by either blocking, throttling or serving at reduced feature.

4. Fair shared server resource usage

This is also Layer7 rate limiter. This is based on client and server protocol. Every client willing to use the shared resorces makes RPC call to RateLimiter server to get the quota assign to client for each requested resources.

After that client is supposed to generate traffic as per assigned limit. If client generates traffic beyond the limit then request will be either blocked, throttlled or served at reduced feature as configured.

Before expiration, client needs to make another rpc call to RateLimiter to find out the next available quota. RateLimiter server again recalculate the available resources and allocate new quota as fare as possible.

Goal

Our goal is to implement various types of RateLimiter to handle different types of possible implementation. We will focus on L7 based RateLimiter. L3 based RateLimiter can also leverage same approach, but skipping L3 as it needs separate blog post.

Algorithm to implement rate limiter

Before we start designing sytem for RateLimiter, let’s understand following various algorithm used to detect the RateLimiter.

Token bucket
Leaking bucket
Fixed Sliding window
Sliding window log
Sliding window counter

Token bucket

It works by maintaining a virtual “bucket” that can hold a maximum number of “tokens.” Tokens represent units of data that can be processed, and they are added to the bucket at a constant rate. Following are steps to describe how it works.

Initialization: The bucket is initialized with a certain number of tokens, often equal to the maximum it can hold.
Token replenishment: Tokens are added to the bucket at a steady rate, determined by the desired average rate of data flow.
Data transmission: When a data unit arrives, it is checked if there are enough tokens in the bucket to process it. If there are, a token is removed from the bucket, and the data unit is processed. Otherwise, the data unit is either rejected or queued for later processing.

Problem with this algorithm to tune these two parameters.

Leaking bucket algorithm

This is similar to bucket token except that the requests are processed at fixed rate. It is usually implemented with a FIFO queue.

Problem with this is with the burst of traffic fills up the queue with old requests and if they are not processed in time, recent requests will be rate limited.

Fixed window counter algorithm

The algorithm divides the timeline into fix-sized time windows and assign a counter for each window.
Each request increment the counter by one.
Once the counter reaches the pre-defined threshold, new requests are dropped until a new time window starts.

A major problem with this algorithm is that a burst of traffic at the edges of time windows could cause more requests than allowed quota to go through.

Sliding window log algorithm

Sliding window fixes the problem with fixed window algorithm. It works as below.

The algorithm keeps track of request timestamp. Timestamp data is usually kept in cache, such as sorted set of Redis.
When a new request comes in, remove all the outdated timestamps. Outdated timestamps are defined as those older than the start of the current time window.
Add timestamp of the new request to the log.
If the log size is the same or lower than the allowed count, a request is accepted. Otherwise it is rejected.

The log is empty when a new request arrives at 1:00:01. Therefore request is allowed.
A new request arrived at 1:00:30, the timestamp for this request is inserted into the log. After the insertion, the log size is 2, which is not larger than the allowed size 2. Therefore request is allowed.
A new request arrives at 1:00:50 and timestamp is inserted into log. After the insertion, the log size is 3, larger than the allowed limit 2, Therefore this request is rejected. Please note; timestamp is remains written into log.
A new request arries at 1:01:40. Requests in the range [1:00:40, 1:01:40) are within the latest time frame but requests sent before 1:00:40 are outdated. It removes two outdated timestamps 1:00:01 and 1:00:30. After the remove operation, the log size become 2 therefore request is accepted.

Sliding window counter algorithm

Following is diagram to show how it works.

Assume the rate limiter allows a maximum of 7 request per minute and there are 5 requests in the previous minute and 3 in the current minute. For a new request that arrives at a 30% position in the current minute, the number of requests in the rolling windw is calculated using the following formula.

Requests in the current window + requests in the previous window * overlap percentage of the rolling window and previous window.
Using this formula, we get 3 + 5 * 0.7 = 6.5 request. If rounded down then it would be 6.

Since the rate limiter allows a maximum of 7 requests per minute therefore the current request can go through otherwise it will be rejected.

Server side rate limiter (Server Throttler)

ServerThrottler is part of the server-side stack that intercepts all incoming requests and decides which of them are safe to be allowed and which need to be throttled.

Either it can be embedded part of the app task binary as below.

Or it can also be implemented as a sidecar pattern as shown below.

This helps to control the traffic on the App server to make sure resource utilization is under control. We don’t need to scale the Rate limiter component separately as it is embedded with the App server.

Server throttler can use Leaky bucket algorithm to implement the rate limiter. Following are details on how it can be implemented.

QPS Based implementation

Fixed size queue is used to represent the bucket
If the queue is full then incoming requests are rejected.
If the queue is not full the request is added to the queue.
Requests from Queue are processed at a fixed rate.

Cons

QPS based rate limiter doesn’t always help to directly map with the resource usage. One bad request can exhaust all the resources and may cause server outage.

Instead of qps, we can use query cost prediction based processing.

Query Cost prediction algorithm

Query cost estimate about resource usage of request.
Cost prediction is done based on the actual processing costs of similar requests that finished in the recent past.
Cost is predicted in CPU-milliseconds, of each incoming request which considers cpu, memory, network, disk etc in consideration when calculating the cost.

Fixed cost queue is used to represent the bucket
If the queue is full in terms of cost of server then incoming requests are rejected.
If the queue cost is not full then the request is added to the queue.
Requests from Queue are processed at a fixed rate.

Request Criticality

Critically represent how important it is for the server to process the request. Following are possible categories

CRITICAL
NOT_CRITICAL

If queue is full then server will check the criticality of incoming request. If it is NOT_CRITICAL then it will be rejected.

If it is CRITICAL then it will first look into any NOT_CRITICAL request in queue then it will remove it from the queue and allow the incoming request otherwise reject it.

Cons with server side implementation

There is a cost in sending and then rejecting requests. We have already burnt the network capacity and done some processing on the request.
Clients that have their queries rejected have no clue when to try again. The industry standard mechanism for dealing with this is retry with exponential backoff. This inevitably means that some capacity is left unused.
Most server side throttlers require statically distributing available capacity among clients (though some throttlers allow sharing unused capacity between clients who are overshooting their limit). A question with statically pre-partitioning the available capacity is whether every client can use their maximum quota simultaneously.

It is obvious that clients that do not want to have their queries rejected by server side throttling should rate limit themselves.

Local rate limiting

Use a local (task bound) rate limiter which limits the amount of traffic a single task can send. N tasks, each of them rate limited to M(i) requests per second will generate a maximum of the ΣM(i) requests per second.

Limitation

This rate may be just right, too conservative, or too aggressive, based on external circumstances.

Another problem with this approach is that if one task is operating at the rate limit (and throttling itself), whereas another task is not yet at its limit, the slack capacity in the second task is not allocated to the first task.

A system for global distributed client side rate limiting

This system will apportion and distribute limited capacity for a resource or service to clients that want to use this capacity. The system is agnostic with respect to what this capacity represents.

In the proposed solution each client that wants to rate limit its usage of a shared resource (or service) requests and gets an individual capacity limit L from the system. The value of L for a client is dependent on:

The total global capacity: No more capacity will be handed out than is available.
The individual demand of this client (task): Clients will typically not get a higher limit than they request.
The total global demand: Clients might get a lower limit than they are asking for, based on global demand. The algorithms that do the sharing try to divide the available capacity “fairly”.

It is the client’s responsibility not to overshoot the limit L.

Capacity and limits will be re-evaluated regularly by the system and new limits will be handed out to the clients.

Architecture

The proposed system for global distributed client-side rate limiting consists of the following components:

A client-side library which provides features for communicating with the rate limiting servers and for helping applications to uphold the limits received from these servers.
A globally distributed set of rate limiting servers that communicate with the clients, receive all the requests for capacity, apportion the available capacity, and return capacity limits to the requesting clients.

The servers are organized in a hierarchical fashion: The clients communicate with so-called leaf rate limiting servers. The leaves talk to a set of parent servers which aggregate and apportion capacity at a lower level. The parents in their turn talk to root rate limiting servers which perform the same function one level lower in the tree.

The proposed three level hierarchy would allow for the following levels:

Leaf servers in each cluster.
Parent servers for each region.
Root servers globally.

Each resource has a unique identifier which is used by the client when asking for capacity. This identifier is a human readable string which is otherwise opaque to the system.

Lease

The core concept in the rate limiting server is the capacity lease:

// Encapsulates a lease on capacity with a start\_time and expiry\_time in seconds
// since the epoch.
message Lease {
  required int64 expiry\_time = 1;
  required int64 refresh\_interval = 2;
  required double capacity = 3;
}

A Lease protocol buffer describes a lease on a resource. It contains the expiry_time of the lease (in seconds since the epoch). The lease is valid until the expiry_time, but clients are requested (strongly) to contact the rate limiting server every “refresh_interval” seconds to update their leases.

Optimistic, pessimistic and safe mode operation

When a client fails to obtain or renew capacity it can do three things:

In pessimistic mode it can behave as if a capacity of zero has been granted, and effectively stop using the resource. This is obviously safest, but could halt all resource usage if the rate limiting system is temporarily unavailable. This is right course of action for an off-line batch job or daemon which is not interacting with users.
In optimistic mode it can behave as if it got all of the capacity it requested. This means that it can continue to use the resource, but there is a danger that the resource will be overused. This is the right course of action for on-line components that interact with users and with resources that are protected against overloading by other means.
In safe mode the client is allowed to use the resource up to the per-client safe capacity limit of the resource. The safe capacity is configured in the resource template and returned as part of every request for capacity. If the safe capacity is not configured in the resource template the server calculates a dynamic safe capacity (the available capacity divided by the number of clients) and returns that as the safe capacity to use.

Top and Bottom Halves

The client library will consist of two layers with an intermediate data structure. The top half gets called by the client code, whereas the bottom half communicates with the rate limiting system. The client state lives in the grey zone between the two layers, and gets updated by both halves:

Note: This design is based on https://github.com/youtube/doorman/blob/master/doc/design.md

Using Quota server

App server can make rpc call to Rate limiter to check the quota for given

Rolling Sliding window counter or Bucket token design are best suited to implement Rate limiter as a Server.

Local vs global rate limiter

Rate limiter server can be implemented as two tier system. Global server responsible for enforcing long term quota and a local Rate limiter, that is responsible for short term quota.

The main role of local rate limiter is to be as close as possible to the data center as the backend that needs to enforce quota limit. This is intended to minimize the latency between the backend and quota server.

Local Rate limiter

A local rate limiter is deployed in the same zone/cluster to the application. It helps

Customer Quota based design

Quota for each customer can be used to control the traffic. This can be implemented either at server side or client side.

Server side Quota approach

Enforcing customer quotas on the server.
Each server allocates rate quotas to its customers to ensure customer isolation (i.e., fair resource usage).
Customers with no explicit quota get a default quota limit.
Server operators are responsible for both scaling the server resources and managing the per-customer quota configuration.

Client Side Quota approach

Client-side quotas refers to enforcing customer quotas on the client.
Conceptually, each customer is told what quota it should respect.
The clients owned by the customer are responsible for enforcing quota.

Hard limit vs Soft limit

Instead of start rejecting the request on exceeding the quota, we can have two level of limit.

Soft limit
Hard limit

Once quota for customer reached to soft limit then server should start sending warning to the client so that client take appropriate action to reduce the traffic.

If client couldn’t control the traffic and it exceeds the hard limit then server will reject the request.

Happy system designing :-)