Protecting Performance of Production Machine Learning APIs with Effective Rate Limiting

Protecting Performance of Production Machine Learning APIs with Effective Rate Limiting

Introduction

Deploying machine learning (ML) models, including large language models (LLMs), to production is a crucial step in making them accessible and useful in real-world scenarios. However, deploying a model isn't just about making it available; it also involves ensuring it operates efficiently, remains available, and is protected from misuse or overloading.

Once your machine learning model is trained and ready to go, you need to decide how to deploy it. There are several common approaches, each with its own advantages:

  • Real-time APIs: Perfect for applications that need instant feedback, like chatbots or real-time recommendations. This approach allows users to interact with your model on-the-fly.
  • Batch Processing: Ideal for scenarios where you need to process large volumes of data all at once. This is often used for offline analysis or generating reports.
  • Edge Deployment: Deploying models directly on devices or local servers can reduce latency and address data privacy concerns, as the data doesn’t need to be sent to a central server.
  • Serverless Functions: Cloud-based serverless platforms can execute model inference without the need to manage infrastructure, automatically scaling based on demand.
  • Containerisation: Using containers like Docker helps manage and deploy your model consistently across different environments. Tools like Kubernetes can help scale and orchestrate these containers.

In this article, we'll focus on the first approach, ML models that have been productionised and available as an API. Deploying machine learning models as APIs is a common and effective approach in production environments. When a model is available as an API, you can call it, provide input data, and receive output as needed. To ensure the API remains reliable and provides fair service to all clients, it’s crucial to manage the number of requests it receives and prevent it from becoming overwhelmed. This can be achieved using an API rate limiter.

We’ll explain how to design and develop a rate limiter to protect any API-based system, including those serving machine learning models. Using a sample project, we’ll cover the process of implementing rate limiting, setting up the API, and conducting unit tests.


Real-world Examples of Machine Learning as a Service (MLaaS) APIs

MLaaS APIs provide powerful, ready-to-use machine learning capabilities for image analysis, text processing, and more. They allow you to integrate sophisticated ML models into your applications easily without needing to manage the infrastructure or model training directly. Here are some popular MLaaS APIs, including examples of their functionalities and how you might use them:

  • Google Cloud Vision API: Analyses the content of images to detect objects, text, labels, and more.

  • Google Cloud Natural Language API: Analyses text for sentiment, entities, and syntax.
  • Azure Computer Vision API: Extracts information from images and videos.
  • Azure Text Analytics API: Analyses text for sentiment, entities, and key phrases.
  • IBM Watson Visual Recognition: Analyses images to detect and classify objects, scenes, and faces.
  • IBM Watson Natural Language Understanding: Analyses text for sentiment, emotion, and entities.

 

What is API Rate Limiter?

An API rate limiter is a mechanism to control the number of requests a client can make to an API within a specific time period (e.g., 100 requests per minute). It ensures that all clients receive a fair share of resources and prevents any single client from overwhelming the system. Also, it protects backend systems from being overloaded and maintains performance stability. Rate limiting is especially important in high-traffic environments and helps maintain the performance and reliability of the system.


Three Approaches to Design API Rate Limiter

Three common approaches to design the rate limiter are explained in below table and presented in below image. Among them, the third approach (API Gateway Rate Limiter) is generally preferred due to its centralised control, low latency impact, and efficient use of caching for data storage. This method simplifies scaling and provides consistent rate limiting, while minimising added complexity and latency. In addition, there is no extra call to API gateway since anyway we hit the API gateway for checking security of request.


Article content
Three different approaches to design API rate limiter


Article content
Comparing three designs of rate limiter


In order to implement the rate limiter logic, we need (1) client’s user ID or IP address to be able to identify them, (2) number of allowed requests in a specific period of time, and (3) timestamp of latest request per client. Ideally, we need (1) temporary storage for recent data only and (2) very fast access. This makes sure that your memory usage is very low and you are not retaining indefinite unnecessary data. There are two common options as storage, including a database such as MySQL database, and CACHE. MySQL database is not memory efficient since it stores one record for every request, which data size will be extremely big as client sends more requests to API. It causes storing unnecessary data that we may not need to use them after sometime. Also, accessing this data stored in disk, followed by computations and aggregation queries on this big data will be time taking that causes latency. On the other hand, CACHE stores data temporarily and provides very quick access to data because CACHE resides in memory not disk, whereas SQL DB stores data in disk.

Redis is a common caching client that (1) stores data in memory, (2) provides quick access to data, (3) uses time to leave (TTL) to retain data for specific period of time only, and (4) minimises computation time since it includes functions like increment, decrement, and counters, that we can easily use to do very quick math, which is all we need to rate limiter.


API Gateway Rate Limiter with Data Storage (Redis as in-Memory CACHE)

When the client makes the request, the rate limiter gets the request, does some calculation based on data in cache whether the user should be rate limited or not. If the user should be rate limited, it sends http status code of 429 to the user. So, user know that they are rate limited, otherwise it forwards the request to API server and API server sends back the response to the client.

Article content
API rate limiter with Redis as in-memory CACHE


Rate limiter needs to provide enough feedback to the client about why they have been rate limited and when they can send the request. In this work, we use appropriate status code (status code of 429 that represents too many requests) to notify the client that they exceeded sending requests to the API, the time that they can try to send a new request (e.g. ‘Client exceeded sending requests. Please try again after 10 seconds’). Also, when their request is successful, we provide them information about how many more requests they can send in a specific period of time that we mention them too (e.g. ‘Request was successful. They can send 5 more requests in the next 20 seconds’).


Python Implementation

The code for above design of API rate limiter has been developed in Python. Code structure is as below:

 /rate_limiter_project

    ├── requirements.txt

├── rate_limiter.py

    ├── api_gateway.py

    ├── test_rate_limiter.py

   

Each part of the code has been explain below. Also, please check out end-to-end code for this project at this Git repository.


requirements.txt

Specify dependencies and Python packages used in this work.

Article content

rate_limiter.py

Implements the core rate limiting logic.

Article content

api_gateway.py

Sets up the API endpoint that utilises the rate limiter.

Article content

test_rate_limiter.py

Writes tests to validate the functionality of the rate limiter.

Article content

Outputs

Redis Serve is up and running

Redis is used to store the CACHE and perform computations regarding rate limiter decisioning

Article content

A demanding API has been created using Flask, which is up and running

This App serves as an API gateway with rate limiting functionality

Article content

Output of unit test

  • If client request to the API is allowed, client can see how many more requests they can send to API followed by the time period.
  • If client exceeded number of requests to API, they will be informed when they are allowed to send the next request to the API.

Article content

If you have any questions about setting up the project, replicating the code, or running it, please don’t hesitate to reach out. I’m more than happy to help with any issues you encounter or provide additional guidance.

To view or add a comment, sign in

More articles by Amir Amin, Ph.D.

Insights from the community

Others also viewed

Explore topics