From LEGO Bricks to Data Blocks: Adventures with PromptQL

#ai #llm #database #datascience

Did you know that across all the LEGO Star Wars sets, there are 9113 unique pieces used?

If you were to build every single one of the Star Wars sets, you would have used 63,269 of the "Light Bluish Gray" color (#A0A5A9).

Occasionally, I stumble across unique data and interesting data sets that I want to explore. Usually I end up throwing them in a SQL database and poking around the schema and do some SQL queries here or there. Sometimes I can learn a lot just grepping my way through the files for keywords and doing some analysis of those results.

Given I like playing with data, LLMs and RAG (Retrieval-augmented generation) seem inescapable these days. Using the power of an LLM to ask more in-depth questions across a vectorized space has yielded some surprising results, but I struggle to trust it beyond basic discovery and toy use cases. Don't get me wrong, I do use these tools where I find they aid my process, however I am also cautious when adding new tools to any kind of feature or process, especially ones that can "hallucinate" (read: "lie to your face and not be at all sorry").

L(EGO)LLM

I recently found the LEGO data from Rebrickable Data is downloadable in bulk, and contains all sorts of data about LEGO sets, pieces, minifigures, and more.

There is a lot of CSV data here. Millions of records. And given most of it is just numbers and relationships, RAG seemed like the wrong tool. We aren't trying to search or extract semantic meaning from things like titles or comments or descriptions. We don't have much to use a strategy like text embedding to get vectors. In this case, we are trying to draw meaning between lots of relationships defined with joins on ids and aggregations. However, not all is lost. There's been a flurry of activity with Agents and Agentic tools in the LLM space. Basically, giving an LLM the ability to do things like create and run code, and some degree of "autonomy" to figure out issues.

I've been playing with a new tool called PromptQL from Hasura that leverages what they call "Agentic Data Access", that has an interesting twist on things. (Note: I was given early access and was working with the team to provide feedback and they sponsored some of my time to play with the tool). Building on the standardized data access layer that Hasura already enables, PromptQL is a data access agent that can understand your data schema, and then generates a query plan to access your data and perform computations on the fly. PromptQL isn't just RAG across a vector space. It is querying data straight out of things like Postgres, or MongoDB, or even totally non-SQL sources like GitHub APIs. The ability to make an optimal query plan as opposed to following a strict retrieval pipeline is what makes PromptQL truly powerful.

Back to our LEGO data. The schema is pretty simple, just a lot of data organized into meaningful relationships.

I wrote a simple script to download everything from the Rebrickable website and dump it into a Postgres database. Should be fairly simple to grab and run yourself if you want to do your own tinkering.

It's all about the Prompt(QL)

Once I had that done, I set up PromptQL to point to the database. The PromptQL docs are pretty good at explaining this.

Once done, I fired up the PromptQL interface, and if you have used any chat interface (or ChatGPT like app), you'll feel right at home.

For example, I wonder how many of those Star Wars sets have that Light Bluish Gray piece?

So how do I know I can even trust these results? Simple. I don't have to. Every single thing the Agentic actions of PromptQL are doing is backed by code that it shows me.

The question "How many different colors are there?" replies with the following:

To answer your question about the number of different colors, I'll need to query the Colors table in the database. Let me fetch that information for you.
Query Plan

Query the Colors table to count the number of distinct colors.

Retrieve the total count of colors.

Store the result in an artifact for easy reference.

Present the information to the user.

It then writes and executes the following code:

# Query to count the number of distinct colors
sql = """
SELECT COUNT(DISTINCT id) as color_count
FROM app.Colors
"""

# Execute the query
result = executor.run_sql(sql)

# Extract the color count
color_count = result[0]['color_count']

# Store the result in an artifact
executor.store_artifact(
    'color_count',
    'Total Number of LEGO Colors',
    'table',
    [{'Total Colors': color_count}]
)

# Print the result for verification
executor.print(f"Total number of colors: {color_count}")

Output:
SQL statement returned 1 rows.
Stored table artifact: identifier = 'color_count', title = 'Total Number of LEGO Colors', number of rows = 1, sample rows = [{'Total Colors': 267}]
Total number of colors: 267

I don't have to trust whatever it spins out of some semantic search and RAG. I can look at the actual SQL it put together, and go look at the actual data sources if I need to confirm things. I have a habit of checking the executed code blocks fairly often to sanity check things. This level of transparency is critical, in my opinion, towards debugging Agentic applications.

One of my favorite things to do, because it has knowledge of your schema, you can ask PromptQL what kinds of questions you should even be asking. For the LEGO data I have, it suggested trying the following things:

- Historical evolution of LEGO sets, tracking how piece counts and complexity have changed across decades
- Analysis of rare minifigures and unique parts that only appear in a single set
- Comparison between licensed themes (Star Wars, Harry Potter) and original LEGO themes in terms of popularity and complexity
- Distribution patterns of colors across sets, particularly looking at sets with the most diverse color palettes
- Deep dive into the most epic sets ever produced, ranked by piece count and build complexity

I actually tried the first question, PromptQL yielded some interesting bullet points.

Piece count explosion: From just 18.5 pieces per set (1950s) to 363.3 pieces (2020s) - a massive 1,863.8% increase
Color variety: Sets went from using 2.7 colors (1950s) to 13.9 colors (2020s) - a 414.8% increase
Production peak: The 2010s saw the highest number of set releases with 6,074 different sets
Temporary decline: The 1980s showed an unusual dip in both piece count and color complexity compared to the 1970s
Recent acceleration: The most dramatic increases in both metrics occurred in the 2010s and 2020s

You also aren't just strictly limited to questions answerable directly within the data. For instance, take things like GitHub issues. You can still leverage a typical LLM use case and ask for PromptQL to categorize things by say, the GitHub issue content. Ask for it to sort things into various priorities (high, medium, low) or different types of issue (support, bug, feature). PromptQL will then combine pulling stuff from your sources and then using an LLM to classify them and use those results in your output artifacts or additional steps.

Cool, but what can it do? (Real-world applications beyond plastic bricks)

I've actually played with hooking this up to GitHub issues, and some other data sources. While it has its rough edges, I've been able to accomplish all kinds of interesting things. Like asking PromptQL to identify areas to focus on for improving the project using the actual content from the GitHub issues. Or wiring it up to a support ticket system and billing data, and asking PromptQL to identify the customers that are at risk of churning and help me write customized emails to each of them.

I will mention, sometimes it gets specific SQL features or syntax support wrong for your data source. Other times it may time out on particularly huge or long running queries. However, a very neat feature is PromptQL will catch these issues, and with a feedback loop, it will attempt to self correct itself, which is pretty cool to witness. Other times, gentle steering by you can work around such issues.

Being able to understand what and why the Agent is behaving as it does are incredibly useful. Just as helpful is having what likens to a second set of eyes to help you along as you ask questions and explore, not to mention almost an expert level of SQL relevant to your schema.

Just like LEGO bricks, I've found PromptQL gives me the building blocks to construct meaning from data. Whether you're analyzing plastic bricks or production metrics, having a tool that shows its work and adapts to your schema is invaluable. I'd encourage anyone interested in LLMs, RAG, or Agentic flows to give it a try. And hey, even if you're just someone wanting to explore a dataset or thinking about giving your users a more natural way to ask questions about their data, you might find yourself building something pretty useful with PromptQL!