Code: https://github.com/SamparkAI/Composio-Function-Calling-Benchmark/
In the last blog, we introduced the ClickUp function calling benchmark and experimented with different optimisation approaches for improving function calling using gpt-4-turbo-preview
.
This time, we wanted to check a selection of other models, which might or might not claim to be superior in performance π . We also wanted to make our benchmark test more generalised to find compatible optimisation approaches to specific models for function calling.
Optimisation Techniques
As function calling is a new concept, and not much literature is available, we checked different experiments by the community. From these and our intuition, we realised techniques like flattening the schema structure, making system prompts more focused on function calls, improving the function names, descriptions, parameter descriptions, adding examples, etc. will enhance the function calling performance. So, we decided on this elaborate experiment. To list the methods we experimented with:
- No System Prompt: Only the problem statement
- Flattening Schema : All the hierarchical parameters are flattened to a shallow tree structure
- Flattened Schema + Simple System Prompt : Added a simple system prompt mentioning that function calling needs to be used
- Flattened Schema + Focused System Prompt : Added characterisation on its role in solving function calling problems.
- Flattened Schema + Focused System Prompt + Function Name Optimised : The function names were elaborated.
- Flattened Schema + Focused System Prompt + Function Description Optimised : Explained the descriptions clearly.
- Flattened Schema + Focused System Prompt containing Schema summary : Added summarised version of all function schema to the system prompts
- Flattened Schema + Focused System Prompt containing Schema summary + Function Name Optimised : Summarised function schema in system prompt, with elaborated function names.
- Flattened Schema + Focused System Prompt containing Schema summary + Function Description Optimised : Summarised function schema in system prompt, with clearly explained function descriptions.
- Flattened Schema + Focused System Prompt containing Schema summary + Function and Parameter Descriptions Optimised : Additionally, the description of the parameters was improved
- Flattened Schema + Focused System Prompt containing Schema summary + Function and Parameter Descriptions Optimised + Function Call examples added : Examples of function calls were added along with function descriptions.
- Flattened Schema + Focused System Prompt containing Schema summary + Function and Parameter Descriptions Optimised + Function Parameter examples added: Examples of parameter values were added to parameter descriptions.
OpenAI Models
As we checked gpt-4-turbo-preview
in the previous experiment, we wanted to test the performance of both its predecessor, gpt-4-0125-preview
, and its successor gpt-4-turbo
. As we have seen before, even though the next-generation models are pretty advanced in benchmark scores, they are often not better in an all-encompassing way. So, comparing with our previous scores, here is the performance of these two OpenAI models.
Optimization Approach | gpt-4-turbo-preview | gpt-4-turbo | gpt-4-0125-preview |
---|---|---|---|
No System Prompt | 0.36 | 0.36 | 0.353 |
Flattening Schema | 0.527 | 0.487 | 0.533 |
Flattened Schema + Simple System Prompt | 0.553 | 0.533 | 0.54 |
Flattened Schema + Focused System Prompt | 0.633 | 0.633 | 0.64 |
Flattened Schema + Focused System Prompt + Function Name Optimized | 0.553 | 0.607 | 0.587 |
Flattened Schema + Focused System Prompt + Function Description Optimized | 0.633 | 0.66 | 0.673 |
Flattened Schema + Focused System Prompt containing Schema summary | 0.64 | 0.553 | 0.64 |
Flattened Schema + Focused System Prompt containing Schema summary + Function Name Optimized | 0.70 | 0.707 | 0.686 |
Flattened Schema + Focused System Prompt containing Schema summary + Function Description Optimized | 0.687 | 0.707 | 0.68 |
Flattened Schema + Focused System Prompt containing Schema summary + Function and Parameter Descriptions Optimized | 0.767 | 0.767 | 0.787 |
Flattened Schema + Focused System Prompt containing Schema summary + Function and Parameter Descriptions Optimized + Function Call examples added | 0.693 | 0.6 | 0.707 |
Flattened Schema + Focused System Prompt containing Schema summary + Function and Parameter Descriptions Optimized + Function Parameter examples added | 0.787 | 0.693 | 0.787 |
So we can see that, in most cases, the original gpt-4-0125-preview
performed better. When we added more examples of parameters, in the parameter descriptions, gpt-4-0125-preview
consistently performed better than the other models. In the cases where we optimised or elaborated only the function names and descriptions, we see the gpt-4-turbo
seems to do better.
Anthropic Models
Next, we did the same experimentation with Anthropic's Claude-3 series of models. Claude-3 has three models, haiku
, sonnet
and opus
, in increasing order of parameters and performance(at least that is expected).
When we tried these models, we discovered that Claude models, especially opus
, is very costly, and very slow!! Running the whole benchmark with GPT-4 for one run took ~4 minutes, while claude-3-opus-20240229
took around ~13 minutes. claude-3-haiku-20240307
and claude-3-sonnet-20240229
took about ~3 minutes and ~6 minutes, respectively.
We faced several problems while running the benchmark for clause models. For example, unlike OpenAI models, Claude models' most function/tool calls are preceded by a block of thoughts text, which required some changes in our benchmark code.
Then, while we ran it, we found that the scores were incredibly low in some cases and kind of absurd.
After some digging, we found that sometimes the models predicted the boolean variables as strings, like True
was predicted as "True"
and False
was predicted as "False"
. We added a fix for that and then finally obtained our results.
Optimization Approach | claude-3-haiku-20240307 | claude-3-sonnet-20240229 | claude-3-opus-20240229 |
---|---|---|---|
No System Prompt | 0.48 | 0.6 | 0.42 |
Flattening Schema | 0.5 | 0.58 | 0.5 |
Flattened Schema + Simple System Prompt | 0.54 | 0.6 | 0.54 |
Flattened Schema + Focused System Prompt | 0.54 | 0.54 | 0.54 |
Flattened Schema + Focused System Prompt + Function Name Optimized | 0.52 | 0.62 | 0.52 |
Flattened Schema + Focused System Prompt + Function Description Optimized | 0.52 | 0.6 | 0.52 |
Flattened Schema + Focused System Prompt containing Schema summary | 0.46 | 0.62 | 0.46 |
Flattened Schema + Focused System Prompt containing Schema summary + Function Name Optimized | 0.5 | 0.64 | 0.46 |
Flattened Schema + Focused System Prompt containing Schema summary + Function Description Optimized | 0.5 | 0.6 | 0.6 |
Flattened Schema + Focused System Prompt containing Schema summary + Function and Parameter Descriptions Optimized | 0.58 | 0.74 | 0.58 |
Flattened Schema + Focused System Prompt containing Schema summary + Function and Parameter Descriptions Optimized + Function Call examples added | 0.6 | 0.76 | 0.64 |
Flattened Schema + Focused System Prompt containing Schema summary + Function and Parameter Descriptions Optimized + Function Parameter examples added | 0.68 | 0.76 | 0.66 |
Now I know.., you think they must have messed up the haiku
and opus
models scores. But believe me, I am equally surprised and can ensure that we ran the opus benchmark multiple times and checked the code quite a lot for probable bugs.
opus
, sonnet
and haiku
initially outperform GPT models in non-optimized scenarios. sonnet
consistently outpaces haiku
, as expected. Had opus
maintained this trend, it likely would have surpassed Openai models.
Finally
OpenAI models, especially gpt-4-turbo-preview
, are still the better choice regarding performance and cost.
Optimization Approach | gpt-4-turbo-preview | gpt-4-turbo | gpt-4-0125-preview | claude-3-haiku-20240307 | claude-3-sonnet-20240229 | claude-3-opus-20240229 |
---|---|---|---|---|---|---|
No System Prompt | 0.36 | 0.36 | 0.353 | 0.48 | 0.6 | 0.42 |
Flattening Schema | 0.527 | 0.487 | 0.533 | 0.5 | 0.58 | 0.5 |
Flattened Schema + Simple System Prompt | 0.553 | 0.533 | 0.54 | 0.54 | 0.6 | 0.54 |
Flattened Schema + Focused System Prompt | 0.633 | 0.633 | 0.64 | 0.54 | 0.54 | 0.54 |
Flattened Schema + Focused System Prompt + Function Name Optimized | 0.553 | 0.607 | 0.587 | 0.52 | 0.62 | 0.52 |
Flattened Schema + Focused System Prompt + Function Description Optimized | 0.633 | 0.66 | 0.673 | 0.52 | 0.6 | 0.52 |
Flattened Schema + Focused System Prompt containing Schema summary | 0.64 | 0.553 | 0.64 | 0.46 | 0.62 | 0.46 |
Flattened Schema + Focused System Prompt containing Schema summary + Function Name Optimized | 0.70 | 0.707 | 0.686 | 0.5 | 0.64 | 0.46 |
Flattened Schema + Focused System Prompt containing Schema summary + Function Description Optimized | 0.687 | 0.707 | 0.68 | 0.5 | 0.6 | 0.6 |
Flattened Schema + Focused System Prompt containing Schema summary + Function and Parameter Descriptions Optimized | 0.767 | 0.767 | 0.787 | 0.58 | 0.74 | 0.58 |
Flattened Schema + Focused System Prompt containing Schema summary + Function and Parameter Descriptions Optimized + Function Call examples added | 0.693 | 0.6 | 0.707 | 0.6 | 0.76 | 0.64 |
Flattened Schema + Focused System Prompt containing Schema summary + Function and Parameter Descriptions Optimized + Function Parameter examples added | 0.787 | 0.693 | 0.787 | 0.68 | 0.76 | 0.66 |
All the codes are organised at: https://github.com/SamparkAI/Composio-Function-Calling-Benchmark/.
We're currently deciding which models to test nextβperhaps Mistral or open-source options like Functionary or NexusRaven. Check out our repository and try running these models to compare their performance. If you have questions or suggestions, please submit a pull request. Thank you!
Top comments (0)