Trained in causal language modeling, Large Language Models (LLMs) are adept at a broad spectrum of tasks, yet they often falter in fundamental areas such as logic, calculations, and searches. A particularly challenging situation arises when they inadequately perform in specific fields, like mathematics, but continue to autonomously manage all related computations.
To address this shortcoming, one effective strategy involves embedding the LLM within a framework that enables it to utilize tools. This type of framework is known as an LLM agent.
Let's delve into the mechanics of ReAct agents. I'll try to demonstrate how to construct these agents using the newly incorporated ChatHuggingFace class in LangChain. Concluding my exploration, I compare several open-source LLMs with GPT-3.5 and GPT-4 in a benchmark analysis.
So, how would a complete Agent setup look like?
To illustrate a complete Agent setup, let's delve into an example where an LLM agent is tasked with a specific question, requiring the integration of various tools and observations.
First, we initialize the environment by presenting the LLM agent with the initial question and a suite of tools it can utilize. For instance, if the question is, "What is the weather forecast for Paris tomorrow?", the agent has access to a weather forecasting tool, among others.
The agent then begins processing the question, contemplating the necessary steps to find the answer. It might think, "To answer this, I need the latest weather data for Paris."
The agent's first action would likely be to call the weather forecasting tool. The call might look like this:
Action:
{
"action": "get_weather",
"action_input": {
"location": "Paris",
"date": "tomorrow"
}
}
Once the tool is called, it returns the weather forecast data, which is then appended to the agent's prompt as an observation. For example, the observation might be: {'forecast': 'Partly cloudy, 18°C'}.
The agent now updates its internal state with this new information and re-evaluates the situation. The updated prompt, including the observation, is processed, and the agent determines if it has sufficient information to answer the original question. The LLM is engaged again with this enriched prompt.
If the information is adequate, the agent then formulates a final answer, prefaced with 'Final Answer:', like so:
Final Answer: The weather in Paris tomorrow is expected to be partly cloudy with a temperature of 18°C.
The challenges in such an Agent setup are multifaceted:
Tool Selection: The agent must accurately determine which tool or tools are necessary to solve the given problem, avoiding irrelevant or unhelpful tool calls.
Argument Formatting: When calling tools, the agent must format its requests correctly. This includes using the right tool names, providing the necessary argument values (not names), and adhering to any specific syntax or format required by the tool.
Information Integration: The agent must effectively incorporate the observations from previous tool uses, along with the initial context, to build towards the final answer. This requires a nuanced understanding of how each piece of information contributes to the overall task.
In essence, a complete Agent setup involves a harmonious interplay between asking the right questions, calling the appropriate tools with precision, and synthesizing all gathered information to reach a coherent and accurate conclusion. This setup, while complex, opens up vast possibilities for LLM agents in solving diverse and intricate problems.
Implementing Agents with LangChain
Recently integrated the ChatHuggingFace wrapper into LangChain, enabling the creation of agents using open-source models.
The process to set up the ChatModel and equip it with tools is straightforward, as detailed in the Langchain documentation.
`from langchain_community.llms import HuggingFaceHub
from langchain_community.chat_models.huggingface import ChatHuggingFace
llm = HuggingFaceHub(
repo_id="HuggingFaceH4/zephyr-7b-beta",
task="text-generation",
)
chat_model = ChatHuggingFace(llm=llm)
`
Transforming the chat_model into an agent involves providing a ReAct style prompt and relevant tools:
`from langchain import hub
from langchain.agents import AgentExecutor, load_tools
from langchain.agents.format_scratchpad import format_log_to_str
from langchain.agents.output_parsers import (
ReActJsonSingleInputOutputParser,
)
from langchain.tools.render import render_text_description
from langchain_community.utilities import SerpAPIWrapper
Initialize tools
tools = load_tools(["serpapi", "llm-math"], llm=llm)
Set up ReAct style prompt
prompt = hub.pull("hwchase17/react-json")
prompt = prompt.partial(
tools=render_text_description(tools),
tool_names=", ".join([t.name for t in tools]),
)
Configure the agent
chat_model_with_stop = chat_model.bind(stop=["\nObservation"])
agent = (
{
"input": lambda x: x["input"],
"agent_scratchpad": lambda x: format_log_to_str(x["intermediate_steps"]),
}
| prompt
| chat_model_with_stop
| ReActJsonSingleInputOutputParser()
)
Create AgentExecutor
agent_executor = AgentExecutor(agent=agent, tools=tools, verbose=True)
agent_executor.invoke(
{
"input": "Who is the current holder of the speed skating world record on 500 meters? What is her current age raised to the 0.43 power?"
}
)
`
The agent processes the input, performing necessary searches and calculations:
Thought: Identify the age of the current speedskating world record holder using the search tool.
Action:
`{
"action": "search",
"action_input": "speed skating world record holder 500m age"
}
Observation: ...
`
Agents Showdown: Evaluating Open-Source LLMs as Reasoning Agents
Evaluation Methodology
We assess the performance of open-source LLMs as general-purpose reasoning agents by testing their logic and basic tool use (calculator and internet search). Our evaluation dataset merges samples from three sources:
HotpotQA for internet search capability: originally a retrieval dataset, it serves here for general question answering with internet access. Some questions require aggregating information from multiple sources, meaning several internet search steps in our context.
GSM8K for calculator usage: testing grade-school math skills solvable by basic arithmetic operations.
GAIA for diverse tool requirements: from this challenging General AI Assistants benchmark, we selected questions solvable with just search and calculator tools.
GPT-4, serving as a judge, evaluates these using a Prometheus prompt format, rating on a 5-point Likert Scale.
Models in the Test
We evaluate several leading open-source models:
- Llama2-70b-chat
- Mixtral-8x7B-Instruct-v0.1
- OpenHermes-2.5-Mistral-7B
- Zephyr-7b-beta
- SOLAR-10.7B-Instruct-v1.0
These models are tested in LangChain's ReAct framework, prompting them to structure their function calls as follows:
{
"action": $TOOL_NAME,
"action_input": $INPUT
}
For perspective, GPT-3.5 and GPT-4 are also evaluated using LangChain's OpenAI-specific agent, optimized for their function-calling format.
Results and Insights
The open-source models, not specifically tuned for the output format, faced a minor disadvantage compared to OpenAI models.
Nevertheless, some models showed impressive results. For instance, Mixtral-8x7B outperformed GPT-3.5, especially noteworthy considering it wasn't fine-tuned for agent workflows. Challenges included improper formatting of tool calls in some instances.
Here is a benchmark of the models on evaluation dataset (the average scores originally on a scale of 1-5 have been converted to a scale of 0-100% for readability):
I encourage open-source developers to fine-tune Mixtral for agent tasks, aiming to surpass GPT-4.
Concluding Observations:
- The GAIA benchmark, despite being a subset test with limited tools, appears to be a strong indicator of model performance in agent workflows.
- Agent workflows enhance LLM performance. For instance, GPT-4's performance on GSM8K improved from 92% (5-shot CoT) to 95% with the addition of a calculator. Similarly, Mixtral-8x7B jumped from 57.6% (5-shot) to 73% in zero-shot with the same enhancement. This analysis suggests that fine-tuning and tool integration are key for advancing LLM capabilities in agent frameworks.
Top comments (0)