API Rate Limiting: The Ultimate Guide- Kinsta®

APIs are a great way for software apps to communicate with each other. They allow software applications to interact and share resources or privileges.

Today, many B2B companies offer their services via APIs that can be consumed by apps made in any programming language and framework. However, this leaves them vulnerable to DoS and DDoS attacks and can also lead to an uneven distribution of bandwidth between users. To tackle these issues, a technique known as API rate limiting is implemented. The idea is simple — you limit the number of requests that users can make to your API.

In this guide, you will learn what API rate limiting is, the multiple ways it can be implemented, and a few best practices and examples to remember when setting up API rate limits.

What Is API Rate Limiting?

In simple words, API rate limiting refers to setting a threshold or limit over the number of times an API can be accessed by its users. The limits can be decided in multiple ways.

1. User-based Limits

One of the ways to set a rate limit is to reduce the number of times a particular user can access the API in a given timeframe. This can be achieved by counting the number of requests made using the same API key or IP address, and when a threshold is reached, further requests are throttled or denied.

2. Location-based Limits

In many cases, developers want to distribute the available bandwidth for their API equally among certain geographic locations.

The recent ChatGPT preview service is a good example of location-based rate limiting as they started limiting requests based on user locations on the service’s free version once the paid version was rolled out. It made sense since the free preview version was supposed to be used by people worldwide to generate a good sample of usage data for the service.

3. Server-based Limits

Server-based rate limiting is an internal rate limit implemented on the server side to ensure equitable distribution of server resources such as CPU, memory, disk space, etc. It is done by implementing a limit on each server of a deployment.

When a server reaches its limit, further incoming requests are routed to another server with available capacity. If all servers have reached capacity, the user receives a 429 Too Many Requests response. It is important to note that server-based rate limits are applied to all clients irrespective of their geographical location, time of access, or other factors.

Types of API Rate Limits

Apart from the nature of the implementation of the rate limits, you can also classify rate limits based on their effect on the end user. Some common types are:

Hard limits: These are strict limits that, when crossed, will completely restrict the user from accessing the resource until the limit is lifted.
Soft limits: These are flexible limits that, when crossed, might still allow the user to access the resource a few more times (or throttle the requests) before shutting access.
Dynamic limits: These limits depend on multiple factors such as server load, network traffic, user location, user activity, traffic distribution, etc., and are changed in real-time for efficient resource functioning.
Throttles: These limits do not cut off access to the resource but rather slow down or queue further incoming requests until the limit is lifted.
Billable limits: These limits do not restrict access or throttle speed but instead charge the user for further requests when the set free threshold is exceeded.

Why Is Rate Limiting Necessary?

There are multiple reasons why you’d need to implement rate limiting in your web APIs. Some of the top reasons are:

1. Protecting Resource Access

The first reason why you should consider implementing an API rate limit in your app is to protect your resources from being overexploited by users with malicious intent. Attackers can use techniques like DDoS attacks to hog up access to your resources and prevent your app from functioning normally for other users. Having a rate limit in place ensures that you are not making it easy for attackers to disrupt your APIs.

2. Splitting Quota Among Users

Apart from protecting your resources, the rate limit allows you to split your API resources among users. This means that you can create tiered pricing models and cater to the dynamic needs of your customers without letting them affect other customers.

3. Enhancing Cost-efficiency

Rate limiting also equates to cost limiting. This means you can make a judicious distribution of your resources among your users. With a partitioned structure, it is easier to estimate the cost required for the system’s upkeep. Any spikes can be handled intelligently by provisioning or decommissioning the right amount of resources.

4. Managing Flow Between Workers

Many APIs rely on a distributed architecture that uses multiple workers/threads/instances to handle incoming requests. In such a structure, you can use rate limits to control the workload passed to each worker node. This can help you ensure that the worker nodes receive equitable and sustainable workloads. You can easily add or remove workers as and when needed without restructuring the entire API gateway.

Understanding Burst Limits

Another common way of controlling API usage is to set a burst limit (also known as throttling) instead of a rate limit. Burst limits are rate limits implemented for a very small time interval, say a few seconds. For instance, instead of setting up a limit of 1.3 million requests per month, you could set a limit of 5 requests per second. While this equates to the same monthly traffic, it ensures that your customers don’t overload your servers by sending in bursts of thousands of requests at once.

In the case of burst limits, requests are often delayed until the next interval instead of denied. It is also often recommended to use both rate and burst limits together for optimum traffic and usage control.

3 Methods of Implementing Rate Limiting

When it comes to implementation, there are a few methods you can use to set up API rate limiting in your app. They include:

1. Request Queues

One of the simplest practical methods of restricting API access is via request queues. Request queues refer to a mechanism in which incoming requests are stored in the form of a queue and processed one after another up to a certain limit.

A common use case of request queues is segregating incoming requests from free and paid users. Here’s how you can do that in an Express app using the express-queue package:

const express = require('express')
const expressQueue = require('express-queue');

const app = express()

const freeRequestsQueue = expressQueue({
    activeLimit: 1, // Maximum requests to process at once
    queuedLimit: -1 // Maximum requests allowed in queue (-1 means unlimited)
});

const paidRequestsQueue = expressQueue({
    activeLimit: 5, // Maximum requests to process at once
    queuedLimit: -1 // Maximum requests allowed in queue (-1 means unlimited)
});

// Middleware that selects the appropriate queue handler based on the presence of an API token in the request
function queueHandlerMiddleware(req, res, next) {
    // Check if the request contains an API token
    const apiToken = req.headers['api-token'];

    if (apiToken && isValidToken(apiToken)) {
        console.log("Paid request received")
        paidRequestsQueue(req, res, next);
    } else {
        console.log("Free request received")
        freeRequestsQueue(req, res, next);
     }
}

// Add the custom middleware function to the route
app.get('/route', queueHandlerMiddleware, (req, res) => {
    res.status(200).json({ message: "Processed!" })
});

// Check here is the API token is valid or not
const isValidToken = () => {
    return true;
}

app.listen(3000);

2. Throttling

Throttling is another technique used to control access to APIs. Instead of cutting off access after a threshold is reached, throttling focuses on evening out the spikes in API traffic by implementing small thresholds for small time ranges. Instead of setting up a rate limit like 3 million calls per month, throttling sets up limits of 10 calls per second. Once a client sends more than 10 calls in a second, the next requests in the same second are automatically throttled, but the client instantly regains access to the API in the next second.

You can implement throttling in Express using the express-throttle package. Here’s a sample Express app that shows how to set up throttling in your app:

const express = require('express')
const throttle = require('express-throttle')

const app = express()

const throttleOptions = {
    "rate": "10/s",
    "burst": 5,
    "on_allowed": function (req, res, next, bucket) {
        res.set("X-Rate-Limit-Limit", 10);
        res.set("X-Rate-Limit-Remaining", bucket.tokens);
        next()
    },
    "on_throttled": function (req, res, next, bucket) {
        // Notify client
        res.set("X-Rate-Limit-Limit", 10);
        res.set("X-Rate-Limit-Remaining", 0);
        res.status(503).send("System overloaded, try again after a few seconds.");
    }
}

// Add the custom middleware function to the route
app.get('/route', throttle(throttleOptions), (req, res) => {
    res.status(200).json({ message: "Processed!" })
});

app.listen(3000);

You can test the app using a load-testing tool like AutoCannon. You can install AutoCannon by running the following command in your terminal:

npm install autocannon -g

You can test the app using the following:

autocannon http://localhost:3000/route

The test uses 10 concurrent connections that send in requests to the API. Here’s the result of the test:

Running 10s test @ http://localhost:3000/route

10 connections

┌─────────┬──────┬──────┬───────┬──────┬─────────┬─────────┬───────┐
│ Stat    │ 2.5% │ 50%  │ 97.5% │ 99%  │ Avg     │ Stdev   │ Max   │
├─────────┼──────┼──────┼───────┼──────┼─────────┼─────────┼───────┤
│ Latency │ 0 ms │ 0 ms │ 1 ms  │ 1 ms │ 0.04 ms │ 0.24 ms │ 17 ms │
└─────────┴──────┴──────┴───────┴──────┴─────────┴─────────┴───────┘
┌───────────┬─────────┬─────────┬────────┬─────────┬────────┬─────────┬─────────┐
│ Stat      │ 1%      │ 2.5%    │ 50%    │ 97.5%   │ Avg    │ Stdev   │ Min     │
├───────────┼─────────┼─────────┼────────┼─────────┼────────┼─────────┼─────────┤
│ Req/Sec   │ 16591   │ 16591   │ 19695  │ 19903   │ 19144  │ 1044.15 │ 16587   │
├───────────┼─────────┼─────────┼────────┼─────────┼────────┼─────────┼─────────┤
│ Bytes/Sec │ 5.73 MB │ 5.73 MB │ 6.8 MB │ 6.86 MB │ 6.6 MB │ 360 kB  │ 5.72 MB │
└───────────┴─────────┴─────────┴────────┴─────────┴────────┴─────────┴─────────┘

Req/Bytes counts sampled once per second.
# of samples: 11
114 2xx responses, 210455 non 2xx responses
211k requests in 11.01s, 72.6 MB read

Since only 10 requests per second were allowed (with an extra burst of 5 requests), only 114 requests were successfully processed by the API, and the remaining requests were responded to with a 503 error code asking to wait for some time.

3. Rate-limiting Algorithms

While rate limiting looks like a simple concept that can be implemented using a queue, it can, in fact, be implemented in multiple ways offering various benefits. Here are a few popular algorithms used to implement rate limiting:

Fixed Window Algorithm

The fixed window algorithm is one of the simplest rate-limiting algorithms. It limits the number of requests that can be handled in a fixed time interval.

You set a fixed number of requests, say 100, that can be handled by the API server in an hour. Now, when the 101st request arrives, the algorithm denies processing it. When the time interval resets (i.e., in the next hour), another 100 incoming requests can be processed.

This algorithm is straightforward to implement and works well in many cases where server-side rate limiting is needed to control bandwidth (in contrast to distributing bandwidth among users). However, it can result in spiky traffic/processing towards the edges of the fixed time interval. The sliding window algorithm is a better alternative in cases where you need even processing.

Sliding Window Algorithm

The sliding window algorithm is a variation of the fixed window algorithm. Instead of using fixed predefined time intervals, this algorithm uses a rolling time window to track the number of processed & incoming requests.

Instead of looking at the absolute time intervals (of, say, 60 seconds each), such as 0s to 60s, 61s to 120s, and so on, the sliding window algorithm looks at the previous 60s from when a request is received. Let’s say a request is received at 82nd second; then the algorithm will count the number of requests processed between 22s and 82s (instead of the absolute interval 60s to 120s) to determine if this request can be processed or not. This can prevent situations in which a large number of requests is processed at both the 59th and 61st seconds, overloading the server for a very short period.

This algorithm is better at handling burst traffic more easily but can be more difficult to implement and maintain compared to the fixed window algorithm.

Token Bucket Algorithm

In this algorithm, a fictional bucket is filled with tokens, and whenever the server processes a request, a token is taken out of the bucket. When the bucket is empty, no more requests can be processed by the server. Further requests are either delayed or denied until the bucket is refilled.

The token bucket is refilled at a fixed rate (known as token generation rate), and the maximum number of tokens that can be stored in the bucket is also fixed (known as bucket depth).

By controlling the token regeneration rate and the depth of the bucket, you can control the maximum rate of traffic flow allowed by the API. The express-throttle package you saw earlier uses the token bucket algorithm to throttle or control the flow of API traffic.

The biggest benefit of this algorithm is that it supports burst traffic as long as it can be accommodated in the bucket depth. This is especially useful for unpredictable traffic.

Leaky Bucket Algorithm

The leaky bucket algorithm is another algorithm for handling API traffic. Instead of maintaining a bucket depth that determines how many requests can be handled in a time frame (like in a token bucket), it allows a fixed flow of requests from the bucket, which is analogous to the steady flow of water from a leaky bucket.

The bucket depth, in this case, is used to determine how many requests can be queued to be processed before the bucket starts overflowing, i.e., denying incoming requests.

The leaky bucket promises a steady flow of requests and, unlike the token bucket, does not handle spikes in traffic.

Best Practices For API Rate Limiting

Now that you understand what API rate limiting is and how it is implemented. Here are a few best practices you should consider when implementing it in your app.

Offer a Free Tier for Users To Explore Your Services

When considering implementing an API rate limit, always try to offer an adequate free tier that your perspective users can use to try out your API. It doesn’t have to be very generous, but it should be enough to allow them to test your API comfortably in their development app.

While API rate limits are vital to maintaining the quality of your API endpoints for your users, a small unthrottled free tier can help you gain new users.

Decide What Happens When Rate Limit Is Exceeded

When a user exceeds your set API rate limit, there are a couple of things you should take care of to ensure that you present a positive user experience while still protecting your resources. Some questions you should ask and considerations you must make are:

What Error Code and Message Will Your Users See?

The first thing you must take care of is informing your users that they have exceeded the set API rate limit. To do this, you need to change the API response to a preset message that explains the issue. It is important that the status code for this response be 429 “Too Many Requests.” It is also customary to explain the issue in the response body. Here’s what a sample response body could look like:

{
    "error": "Too Many Requests",
    "message": "You have exceeded the set API rate limit of X requests per minute. Please try again in a few minutes.",
    "retry_after": 60
}

The sample response body shown above mentions the error name and description and also specifies a duration (usually in seconds) after which the user can retry sending requests. A descriptive response body like this helps the users to understand what went wrong and why they did not receive the response they were expecting. It also lets them know how long to wait before sending another request.

Will New Requests Be Throttled or Completely Stopped?

Another decision point is what to do after the set API rate limit is crossed by a user. Usually, you would limit the user from interacting with the server by sending back a 429 “Too Many Requests” response, as you saw above. However, you should also consider an alternate approach—throttling.

Instead of cutting off access to the server resource completely, you can instead slow down the total number of requests that the user can send in a timeframe. This is useful when you want to give your users a little slap on the wrists but still allow them to continue working if they reduce their request volume.

Consider Caching and Circuit Breaking

API rate limits are unpleasant—they restrict your users from interacting with and using your API services. It is especially worse for users that need to make similar requests again and again, such as accessing a weather forecast dataset that gets updated only weekly or fetching a list of options for a dropdown that might be changed once in a blue moon. In these cases, an intelligent approach would be to implement caching.

Caching is a high-speed storage abstraction implemented in cases where data access volume is high, but the data does not change very often. Instead of making an API call that might invoke multiple internal services and incur heavy expenses, you could cache the most frequently used endpoints so that the second request onwards is served from the static cache, which is usually faster, cheaper, and can reduce the workload from your main services.

There can be another case where you receive an unusually high number of requests from a user. Even after setting a rate limit, they’re consistently reaching their capacity and getting rate limited. Such situations indicate that there is a chance of potential API abuse.

To protect your services from overloading and to maintain a uniform experience for the rest of your users, you should consider restricting the suspect user from the API completely. This is known as circuit breaking, and while it sounds similar to rate limiting, it is generally used when the system faces an overload of requests and needs time to slow down to regain its quality of service.

Monitor Your Setup Closely

While API rate limits are meant to distribute your resources equitably between your users, they can sometimes cause unnecessary trouble to your users or even possibly indicate suspicious activity.

Setting up a robust monitoring solution for your API can help you understand how often the rate limits are being achieved by your users, whether or not you need to reconsider the general limits while keeping the average workload of your users in mind and identify users that hit their limits frequently (which could indicate they’d possibly need an increase in their limits soon or they need to be monitored for suspicious activity). In any case, an active monitoring setup will help you understand the impact of your API rate limits better.

Implement Rate Limiting at Multiple Layers

Rate limiting can be implemented at multiple levels (user, application, or system). Many people make the mistake of setting up rate limits at just one of these levels and expecting it to cover all possible cases. While it is not exactly an anti-pattern, it can turn out to be ineffective in some cases.

If incoming requests overload your system’s network interface, your application level rate limiting might not even be able to optimize workloads. Therefore it’s best to set up the rate limit rules at more than one level, preferably on the topmost layers of your architecture, to ensure no bottlenecks are created.

Working With API Rate Limits

In this section, you will learn how to test the API rate limits for a given API endpoint and how to implement a usage control on your client to ensure you don’t end up exhausting your remote API limits.

How To Test API Rate Limits

To identify the rate limit for an API, your first approach should always be to read the API docs to identify if the limits have been clearly defined. In most cases, the API docs will tell you the limit and how it has been implemented. You should resort to “testing” the API rate limit to identify it only when you can not identify it from the API docs, support, or community. This is because testing an API to find its rate limit means you will end up exhausting your rate limit at least once, which might incur financial costs and/or API unavailability for a certain duration.

If you are looking to manually identify the rate limit, you should first begin with a simple API testing tool like Postman to make requests manually to the API and see if you can exhaust its rate limit. If you can’t, you can then use a load testing tool like Autocannon or Gatling to simulate a large number of requests and see how many requests are handled by the API before it starts responding with a 429 status code.

Another approach can be to use a rate limit checker tool like AppBrokers’ rate-limit-test-tool. Dedicated tools like this automate the process for you and also provide you with a user interface to analyze the test results carefully.

However, if you are not sure of an API’s rate limit, you can always try to estimate your request requirements and set up limits on your client side to ensure that the number of requests from your app doesn’t exceed that number. You’ll learn how to do that in the next section.

How To Throttle API Calls

If you are making calls to an API from your code, you may want to implement throttles on your side to ensure you don’t end up accidentally making too many calls to the API and exhausting your API limit. There are multiple ways to do this. One of the popular ways is to use the throttle method in the lodash utility library.

Before you start throttling an API call, you will need to create an API. Here’s a sample code for a Node.js-based API that prints the average number of requests it receives per minute to the console:

const express = require('express');
const app = express();

// maintain a count of total requests
let requestTotalCount = 0;
let startTime = Date.now();

// increase the count whenever any request is received
app.use((req, res, next) => {
    requestTotalCount++;
    next();
});

// After each second, print the average number of requests received per second since the server was started
setInterval(() => {
    const elapsedTime = (Date.now() - startTime) / 1000;
    const averageRequestsPerSecond = requestTotalCount / elapsedTime;
    console.log(`Average requests per second: ${averageRequestsPerSecond.toFixed(2)}`);
}, 1000);

app.get('/', (req, res) => {
    res.send('Hello World!');
});

app.listen(3000, () => {
    console.log('Server listening on port 3000!');
});

Once this app runs, it will print the average number of requests received every second:

Average requests per second: 0
Average requests per second: 0
Average requests per second: 0

Next, create a new JavaScript file by the name test-throttle.js and save the following code in it:

// function that calls the API and prints the response
const request = () => {
    fetch('http://localhost:3000')
    .then(r => r.text())
    .then(r => console.log(r))
}

// Loop to call the request function once every 100 ms, i.e., 10 times per second
setInterval(request, 100)

Once you run this script, you will notice that the average number of requests for the server jumps up close to 10:

Average requests per second: 9.87
Average requests per second: 9.87
Average requests per second: 9.88

What if this API only allowed 6 requests per second, for instance? You’d want to keep your average requests count below that. However, if your client sends a request based on some user activity, such as the click of a button or a scroll, you might not be able to limit the number of times the API call is triggered.

The throttle() function from the lodash can help here. First of all, install the library by running the following command:

npm install lodash

Next, update the test-throttle.js file to contain the following code:

// import the lodash library
const { throttle } = require('lodash');

// function that calls the API and prints the response
const request = () => {
    fetch('http://localhost:3000')
    .then(r => r.text())
    .then(r => console.log(r))
}

// create a throttled function that can only be called once every 200 ms, i.e., only 5 times every second
const throttledRequest = throttle(request, 200)

// loop this throttled function to be called once every 100 ms, i.e., 10 times every second
setInterval(throttledRequest, 100)

Now, if you look at the server logs, you’ll see a similar output:

Average requests per second: 4.74
Average requests per second: 4.80
Average requests per second: 4.83

This means that even though your app is calling the request function 10 times every second, the throttle function ensures that it gets called only 5 times a second, helping you stay under the rate limit. This is how you can set up client-side throttling to avoid exhausting API rate limits.

Common API Rate Limit Errors

When working with rate-limited APIs, you might encounter a variety of responses that indicate when a rate limit has been exceeded. In most cases, you will receive the status code 429 with a message similar to one of these:

Calls to this API have exceeded the rate limit
API rate limit exceeded
429 too many requests

However, the message that you receive depends on the implementation of the API you’re using. This implementation can vary, and some APIs might not even use the 429 status code at all. Here are some other types of rate-limit error codes and messages you might receive when working with rate-limited APIs:

403 Forbidden or 401 Unauthorized: Some APIs may start treating your requests as unauthorized hence denying you access to the resource
503 Service Unavailable or 500 Internal Server Error: If an API is overloaded by incoming requests, it might start sending 5XX error messages indicating that the server is not healthy. This is usually temporary and fixed by the service provider in due time.

How Top API Providers Implement API Rate Limits

When setting the rate limit for your API, it can help to take a look at how some of the top API providers do it:

Discord: Discord implements rate limiting in two ways: there is a global rate limit of 50 requests per second. Apart from the global limit, there also are route-specific rate limits that you need to keep in mind. You can read all about it in this documentation. When the rate limit is exceeded, you will receive an HTTP 429 response with a retry_after value that you can use to wait before sending another request.
Twitter: Twitter also has route-specific rate limits that you can find in their documentation. Once the rate limit is exceeded, you will receive an HTTP 429 response with a x-rate-limit-reset header value that will let you know when you can resume access.
Reddit: Reddit’s archived API wiki states that the rate limit for accessing the Reddit API is 60 requests per minute (via OAuth2 only). The response to each Reddit API call returns the values for X-Ratelimit-Used, X-Ratelimit-Remaining, and X-Ratelimit-Reset headers with which you can determine when the limit might exceed and how lo
Facebook: Facebook also sets route-based rate limits. For instance, calls made from Facebook-based apps are limited to 200 * (number of app users) requests per hour. You can find the complete details here. Responses from the Facebook API will contain a X-App-Usage or a X-Ad-Account-Usage header to help you understand when your usage will be throttled.

Summary

When building APIs, ensuring optimum traffic control is crucial. If you don’t keep a close eye on your traffic management, you will soon end up with an API that is overloaded and non-functional. Conversely, when working with a rate-limited API, it is important to understand how rate-limiting works and how you should use the API to ensure maximum availability and usage.

In this guide, you learned about API rate limiting, why it is necessary, how it can be implemented, and some best practices you should keep in mind when working with API rate limits.

Check out Kinsta’s Application Hosting and spin up your next Node.js project today!

Are you working with a rate-limited API? Or have you implemented rate limiting in your own API? Let us know in the comments below!

Powerful Managed WordPress Hosting

How Express Legal Funding grew organic traffic by 50,000% and simplified site management

UNICEF Denmark saves 88% in hosting costs with Kinsta

API Rate Limiting: The Ultimate Guide