Protecting Performance of Production Machine Learning APIs with Effective Rate Limiting
Introduction
Deploying machine learning (ML) models, including large language models (LLMs), to production is a crucial step in making them accessible and useful in real-world scenarios. However, deploying a model isn't just about making it available; it also involves ensuring it operates efficiently, remains available, and is protected from misuse or overloading.
Once your machine learning model is trained and ready to go, you need to decide how to deploy it. There are several common approaches, each with its own advantages:
In this article, we'll focus on the first approach, ML models that have been productionised and available as an API. Deploying machine learning models as APIs is a common and effective approach in production environments. When a model is available as an API, you can call it, provide input data, and receive output as needed. To ensure the API remains reliable and provides fair service to all clients, it’s crucial to manage the number of requests it receives and prevent it from becoming overwhelmed. This can be achieved using an API rate limiter.
We’ll explain how to design and develop a rate limiter to protect any API-based system, including those serving machine learning models. Using a sample project, we’ll cover the process of implementing rate limiting, setting up the API, and conducting unit tests.
Real-world Examples of Machine Learning as a Service (MLaaS) APIs
MLaaS APIs provide powerful, ready-to-use machine learning capabilities for image analysis, text processing, and more. They allow you to integrate sophisticated ML models into your applications easily without needing to manage the infrastructure or model training directly. Here are some popular MLaaS APIs, including examples of their functionalities and how you might use them:
What is API Rate Limiter?
An API rate limiter is a mechanism to control the number of requests a client can make to an API within a specific time period (e.g., 100 requests per minute). It ensures that all clients receive a fair share of resources and prevents any single client from overwhelming the system. Also, it protects backend systems from being overloaded and maintains performance stability. Rate limiting is especially important in high-traffic environments and helps maintain the performance and reliability of the system.
Three Approaches to Design API Rate Limiter
Three common approaches to design the rate limiter are explained in below table and presented in below image. Among them, the third approach (API Gateway Rate Limiter) is generally preferred due to its centralised control, low latency impact, and efficient use of caching for data storage. This method simplifies scaling and provides consistent rate limiting, while minimising added complexity and latency. In addition, there is no extra call to API gateway since anyway we hit the API gateway for checking security of request.
In order to implement the rate limiter logic, we need (1) client’s user ID or IP address to be able to identify them, (2) number of allowed requests in a specific period of time, and (3) timestamp of latest request per client. Ideally, we need (1) temporary storage for recent data only and (2) very fast access. This makes sure that your memory usage is very low and you are not retaining indefinite unnecessary data. There are two common options as storage, including a database such as MySQL database, and CACHE. MySQL database is not memory efficient since it stores one record for every request, which data size will be extremely big as client sends more requests to API. It causes storing unnecessary data that we may not need to use them after sometime. Also, accessing this data stored in disk, followed by computations and aggregation queries on this big data will be time taking that causes latency. On the other hand, CACHE stores data temporarily and provides very quick access to data because CACHE resides in memory not disk, whereas SQL DB stores data in disk.
Redis is a common caching client that (1) stores data in memory, (2) provides quick access to data, (3) uses time to leave (TTL) to retain data for specific period of time only, and (4) minimises computation time since it includes functions like increment, decrement, and counters, that we can easily use to do very quick math, which is all we need to rate limiter.
API Gateway Rate Limiter with Data Storage (Redis as in-Memory CACHE)
When the client makes the request, the rate limiter gets the request, does some calculation based on data in cache whether the user should be rate limited or not. If the user should be rate limited, it sends http status code of 429 to the user. So, user know that they are rate limited, otherwise it forwards the request to API server and API server sends back the response to the client.
Rate limiter needs to provide enough feedback to the client about why they have been rate limited and when they can send the request. In this work, we use appropriate status code (status code of 429 that represents too many requests) to notify the client that they exceeded sending requests to the API, the time that they can try to send a new request (e.g. ‘Client exceeded sending requests. Please try again after 10 seconds’). Also, when their request is successful, we provide them information about how many more requests they can send in a specific period of time that we mention them too (e.g. ‘Request was successful. They can send 5 more requests in the next 20 seconds’).
Recommended by LinkedIn
Python Implementation
The code for above design of API rate limiter has been developed in Python. Code structure is as below:
/rate_limiter_project
├── requirements.txt
├── rate_limiter.py
├── api_gateway.py
├── test_rate_limiter.py
Each part of the code has been explain below. Also, please check out end-to-end code for this project at this Git repository.
requirements.txt
Specify dependencies and Python packages used in this work.
rate_limiter.py
Implements the core rate limiting logic.
api_gateway.py
Sets up the API endpoint that utilises the rate limiter.
test_rate_limiter.py
Writes tests to validate the functionality of the rate limiter.
Outputs
Redis Serve is up and running
Redis is used to store the CACHE and perform computations regarding rate limiter decisioning
A demanding API has been created using Flask, which is up and running
This App serves as an API gateway with rate limiting functionality
Output of unit test
If you have any questions about setting up the project, replicating the code, or running it, please don’t hesitate to reach out. I’m more than happy to help with any issues you encounter or provide additional guidance.
IRIMO
8moInsightful