Leveraging GenServer and Queueing Techniques: Handling API Rate Limits to AI Inference services

#elixir #api #serverless #ai

In the realm of efficient application development, managing external service rate limits is a pivotal challenge. Recently faced this task while interfacing with the Fireworks serverless API. In the world of serverless APIs, rate limits can be a significant challenge to overcome. The Fireworks AI platform, in particular, comes with a shared 600 requests per minute limit between inference and embedding functionalities. However, with the right approach, it's possible to optimize this limit to accommodate multiple users and ensure consistent response times.Fireworks provides a set of 2 API keys which means you can go upto 1200 req/min if you can successfully load balance between them. I wrote a service called ping pong to do that but we won't be discussing about Load balancing. We will be going over more exciting bit of ping pong about how to manage rate limit and queue incoming requests to not drop any request using GenServer and queue them with an acceptable timeout limit.

In a typical scenario, users may submit multiple requests simultaneously, with each subsequent request consuming the available quota more quickly than the previous one. For instance, if 600 requests come in within the first ten seconds of processing, the remaining 50 seconds will see all requests being rate-limited due to their higher priority.

However, we needed a solution that could efficiently process all incoming requests while maintaining fairness and preventing any potential overloading. Our approach involved utilizing GenServer and queues to manage these requests effectively.

GenServer is a powerful tool in Elixir for managing stateful applications. It allows us to hold the request count state and queue up new requests, ensuring they are processed in an orderly manner without overwhelming the system.The built-in timeout mechanism within genservers ensures that if a wait becomes too protracted, older requests are dropped gracefully without spiraling out to form an uncontrollable backlog prone to overwhelm our system resources; maintaining the sanctity of fairness and balance.

We decided to have a separate rate limiter GenServer process for both inference and embedding functionalities due to their different priorities. For example, if there's an inferred request coming in while an embedding request is being processed, it's essential to prioritize the more frequent requests first.

In our setup, we defined a limit of 400 requests per minute for inference and 200 for embedding, knowing that inference would typically have higher frequency. This method ensures that critical or high-priority tasks receive timely attention while less demanding operations can be processed with lower priority but still within acceptable limits.

When there's no requests in the queue, it’s crucial to handle this situation gracefully without waiting indefinitely. In such cases, we might see bursts of 600 or more incoming requests that do not overlap with each other. To ensure a smooth operation, retries are essential.

For these instances, ElixirRetry—an amazingly powerful library for handling retries—comes into play. It provides built-in retry logic with backoff strategies, ensuring that the system can handle transient issues without being left hanging or overloaded.

Here's where the bespoke service I developed Ping Pong that comes into play for balancing loads and ensuring no request drops due to rate limit constraints, with its source code available at https://github.com/ahsandar/ping_pong.

The rate limiter GenServer code is as below


defmodule PingPong.RateLimiter do
  use GenServer
  require Logger

  alias PingPong.Utility, as: Utility
  @rate_limit_window 60_000
  @safe_limit 1

  def start_link(opts) do
    GenServer.start_link(__MODULE__, opts, name: opts[:name])
  end

  def init(config) do
    Cachex.put(:ping_pong, Utility.cachex_counter(), 0)

    {:ok,
     %{
       name: config[:name],
       start_time: Utility.datetime_iso8601(),
       rate_limit: String.to_integer(config[:rate_limit] || "60")
     }}
  end

  def handle_call(:control_rate_limit, _from, state) do
    sleep_time = (@rate_limit_window / state.rate_limit) |> trunc()

    {_, count} = Cachex.get(:ping_pong, Utility.cachex_counter())
    Logger.info("Count since #{state.start_time}: #{count}")

    if count > @safe_limit do
      Logger.info(
        "Ensuring #{state.name} rate limit at #{Utility.datetime_iso8601()}, waiting for #{sleep_time}"
      )

      :timer.sleep(sleep_time)
    else
      Logger.info("Request count in safe zone")
    end

    {_, count} = Cachex.decr(:ping_pong, Utility.cachex_counter())
    Logger.info("Count since #{state.start_time}: #{count}")
    {:reply, :ok, state}
  end

  def queue(name, timeout \\ :infinity) do
    GenServer.call(name, :control_rate_limit, timeout)
  end
end

DEV Community

Leveraging GenServer and Queueing Techniques: Handling API Rate Limits to AI Inference services

Top comments (0)

Read next

How to Use PydanticAI for Structured Outputs with Multimodal LLMs

Tiny AI Safety Guard Matches Larger Models with 98% Accuracy, Runs on Phones

Top 7 Data Careers You Should Know About in 2025

AI Breakthroughs: Language Models Can Now Control Computer Interfaces Like Humans