Approaching human-level performance
In the previous article, I talked about the benchmark performance of LLMs on Text-to-SQL tasks.
As of the time of writing this article, there is still a huge gap between the performance of LLM-based solutions and baseline human-level performance (HLP).
We acknowledged this problem but also discussed that the benchmark performance is conceptually not the same thing as the performance for a specific business case.
For example, it has been demonstrated by the benchmark (BIRD) that the addition of extra knowledge vastly increases the model’s performance. Here, we speculate that it might be possible to achieve a performance much closer to HLP if we think of HLP as a typical data analyst in a specific company.
Even if the AI solution is not as ‘intelligent’ as a human data analyst, it might be better than a human at something else, that can help bridge the gap between the model performance and HLP.
One of these things is knowledge base examples (external knowledge) which is the collection of ground truth queries for the pre-defined questions.
If the size of this collection is huge and the queries are complex, it might be unachievable for a human to remember them all, and reuse them effectively.
In contrast, similarity search integrated into an LLM-based AI solution, also known as Retrieval Augmented Generation (RAG), can retrieve relevant knowledge base examples quickly and effectively.
Steps not from general knowledge, but from business knowledge
Let’s say we have an LLM agent that is capable of breaking down complex input questions into multiple simple steps, for which a separate query is generated.
This muti-step approach has been demonstrated to have better performance on the benchmarks as compared to the baseline with the same LLM (BIRD). Here, we go one step further and propose an approach to break down the input question by the most similar examples.
Instead of creating a series of steps from LLM’s general understanding and retrieving the most similar example from the knowledge base for each step, we also generate the steps so that they would be as close to the available examples as possible.
For example, assume we have the examples for ‘customer growth’ and ‘monthly revenue’ and the question is ‘How is the revenue generated related to the increase in customers?’.
With the proposed approach, the model would consider available examples at the question breakdown steps, and generate the sub-questions as close to the available examples as possible.
This step would encourage the model to re-use the available examples as much as possible, in this case, the examples for ‘customer growth’ and ‘monthly revenue’.
The closer the provided example is to the desired question, the bigger the chance the generated query would be accurate.
The proposed solution uses RAG twice: once for breaking down the question, and once for generating the sub-query (Fig.1).
Fig. 1. Proposed workflow for breaking down questions into simple steps using RAG based on available business knowledge examples.
Unlimited examples at a lower cost
The first step of the proposed solution uses an LLM agent to break down business questions into sub-questions by the available ground truth examples from the knowledge base.
One may argue that the RAG is not necessary at this step, as we could simply parse all available examples into the prompt.
This solution would work for a small knowledge base, but as the knowledge base grows bigger, it would be more difficult to fit all available examples into a single prompt.
Even as bigger prompt limits are now available, it would not be optimal from the cost-saving perspective. By using RAG with a limited number of similar examples, we can achieve both better performance and reasonable cost savings.
We further speculate that this approach would be most effective with extremely large databases and also a large number of ground truth examples.
We need a better benchmark
Breaking down complex business questions into simple steps and adding ground truth business knowledge greatly improves the performance of LLM-powered Text-to-SQL solutions.
The approach proposed in this article aims at combining the two ideas to break down questions into the steps most similar to the available knowledge base examples.
This could improve the performance of the solutions on the actual real-world problems.
As of now, it is unclear how to evaluate the performance of the proposed solution. To my knowledge, there is no suitable benchmark to test it effectively. As of now, we leave the evaluation of this method for future research.
Top comments (1)
@tiupikov Thank you for the original idea, and useful input into creating this article!