As an engineering leader at a growing startup, I've recently found myself in the thick of some scaling challenges. Our platform has been experiencing an increase in traffic, pushing us to navigate through various performance issues. One of our biggest hurdles is getting a handle on latency and timeout problems, particularly with our calls to different Large Language Models (LLMs). We needed a way to track trends over extended periods without adding complexity to our system. Traditional metric implementation methods seemed daunting – they were time-consuming and required significant code changes, a luxury we couldn't afford in our fast-paced environment.
That's when I stumbled upon Google's Log-based Metrics (https://cloud.google.com/logging/docs/logs-based-metrics). Google's Log-based Metrics is a powerful feature within Google Cloud's operations suite that transforms your application logs into actionable metrics. It allows developers to create custom metrics from existing log data without modifying code, offering easy implementation and flexibility. This tool enables retroactive analysis of historical log data and quick iteration of metrics, all while minimizing overhead on application performance.
Why Log-based Metrics for LLM Monitoring?
Log-based Metrics are particularly useful for monitoring LLM calls because:
- They don't require code changes, allowing quick implementation in fast-paced environments.
- They can analyze historical data, helping identify long-term trends in LLM performance.
- They offer flexibility in metric creation, allowing us to track specific aspects of LLM calls like latency and error rates.
Types of Google Log-based Metrics
Google's Log-based metrics has two types: Count and Distribution. Count-based metrics will be used to determine the number of occurrences based on a filter selection. Distribution-based metrics collects numerical data from your logs. The following is an example of configuring a Log Based Metric.
Configure Distribution Log-based Metrics
Let's walk through an example of how to configure Distribution Log-based Metrics. In this example, I have a log message that outputs the amount of time it takes to call an LLM (in this case, OpenAI). The following is an example log message I am looking for:
"OpenAI API call complete in executionTimeMs: 3112 ms"
To create a metric that tracks execution time, here are the following steps:
a) Define a log filter to match the message pattern. The filter below is an example of looking for logs from a specific pod in our Kubernetes Cluster running on GKE.
resource.type="k8s_container"
resource.labels.project_id="XXXX"
resource.labels.location="us-east1-c"
resource.labels.cluster_name="cluster_name"
resource.labels.namespace_name="env"
labels.k8s-pod/app="pod_name" severity>=DEFAULT
"OpenAI API call complete in"
b) Specify a field name
textPayload
c) Define the regular expression:
In the example above here is my regular expression to get the time from the log message
executionTimeMs: ([0-9]*)
Once you have filled in the above values you can save the metric and go look at Metrics Explorer. You can find your metrics under the section unspecified -> Log-based metrics
After a couple of minutes you should be able to see your metrics and you can build a simple chart or table. The following is example of the distribution of the time to process an OpenAI call for us.
Limitations and Considerations
While Log-based Metrics are powerful, it's important to consider:
- They may introduce a slight delay in metric availability compared to real-time metrics.
- Complex log parsing can potentially impact performance if not optimized.
- They rely on consistent log formatting, so any changes to log structures need to be reflected in the metric configuration.
Google's Log-based Metrics offer a powerful, code-free solution for monitoring application performance, particularly useful for tracking LLM call latencies. We welcome your thoughts and experiences with this tool – have you found it helpful in your projects?
Top comments (1)
Thanks Shannon
We use similar approach to analyze logs based metrics for our apps on azure.