With the explosive growth of cloud data warehouses, an entire ecosystem of new data tools has emerged—transformation, testing, quality, observability, to name a few. On top of this modern data stack, we’ve seen a proliferation of new applications, including dashboards, embedded analytics, automation tools, and vertical-specific reporting tools. And, this boom in the modern data stack necessitates a new generation of business intelligence: headless BI.
Why? Because with every addition to the ecosystem lies the challenge of defining metrics consistently, and making them accessible to every application. This challenge has only grown more urgent—as others have observed.
Specifically, the challenge of data consistency poses a problem for users of traditional business intelligence tools—like Tableau, Looker, or Mode. These tools take defining metrics upon themselves and don’t share their definitions with others:
While these tools enable business users to use their data better, they have one fatal constraint: users can only use the metrics they’ve defined within the four walls of the visualization tool.[1]
In response, we’ve begun to see so-called “headless” BI. These are platforms that decouple data metrics from the presentation layers that display them—pushing metrics definition up the data stack.
However, the popularity of the phrase hasn’t been accompanied by a consistent definition of what headless BI even is. There is more to headless BI than just upstack metrics definition.
For the benefit of data engineers and data consumers alike, we propose a standardized definition of headless business intelligence. And with it, we describe four essential components of a headless BI tool.
1. Data Modeling
This is the most widely understood aspect of headless BI. Simply, software engineering best practices dictate that metrics and their governance should a) live in code, and b) be deduplicated. Deduplication refers to removing them from each end-user application in which they’re variously and inconsistently defined, and moving them upstack.
Fundamentally, defining metrics within a data application leads to duplicated effort and duplicated results. These redundancies—and, therefore, inconsistencies—occur because, regardless of its size, no organization only uses one data application.
Sales teams use a CRM, marketing teams use marketing automation, and executives use dashboards. If each team were to independently count their own definition of “customer” or sum their own interpretation of “annual revenue”, at best these metrics would be defined more times than necessary. What’s more likely is that the definitions wouldn’t match—which is, in every case we can think of, not ideal. This outcome also defeats the whole point of collecting data and distributing it to every team.
When a headless BI tool defines metrics used by every downstream application, companies get uniform insights instead of inconsistent signals. What’s more, headless BI can help organize and simplify once-complex SQL queries by abstracting them away: a metrics layer can generate SQL queries for defined metrics and dimensions, so downstream applications—and their users—don’t have to. This layer also plays an important role in data governance, including lifecycle management, ownership, and change approval.
Lastly, in addition to its own definitions, a headless BI platform should consume those from further upstream—e.g., those from transformation tools like dbt.
2. Access Control
Since a headless BI tool is the first layer downstream of one’s data sources, it must also manage and audit access to that data. These controls can’t be decoupled from the data model, and, therefore, should be defined within it.
Why?
First, this need also stems from a goal of deduplication. As matters of both good architecture and security best practices, the rules of who can access what should be defined and enforced once, not redundantly—and possibly inconsistently—by downstream tools.
Second, by instrumenting access control within the data model, you are able to use dynamic metrics definitions. That is, metrics definitions that vary based on the context in which they’re requested. Contextually-informed metrics are particularly useful in a multi-tenant model in which (e.g.) each of your customers ought to access a defined measure, “sales”, that reflects their revenue. Each data consumer gains access only to the subset of metrics to which they’re entitled—without manually redefining “sales” for each.
Implementing access control within a headless BI tool enables making aggregate metrics widely accessible. Additionally, doing so also protects specific sensitive data (e.g., personally identifiable information.) With the precise ability to control access, you can limit it only to an authorized subset of individuals or applications.
The above two layers constitute what could be considered a “metrics store” or a “modeling layer.” But we’ll continue with our assertion and say: they are not sufficient to make up a headless BI solution.
To constitute headless BI, you also need to make data readily accessible to data consumers, and for that you need…
3. Caching
A headless BI tool is the correct place in which to situate a data caching layer. At first, it may seem counterintuitive to locate a caching layer up-stack of data applications. However, the guiding principles here are, once again, consistency and deduplication.
For starters, a centrally located caching layer ensures consistent data freshness across tools, using one centrally defined schedule of cache invalidation and warming. Every tool will present the same values at the same time.
Additionally, if a headless BI tool has been tasked with accessing data from a data store and organizing it into definitions, then it is optimally positioned to manage queues of requests from multiple applications and data consumers.
Without this mediation, multiple applications may make the same requests. Redundant requests both incur needless bandwidth costs and degrade the performance of other concurrent queries.
Headless BI can provide several levels of caching. First, a query-level cache can store the results of the query to ensure that identical queries from a tool, or multiple tools, do not increase the load on the underlying data warehouse.
Additionally, a second layer can implement aggregate awareness: logic to find the smallest, most efficient table to serve the query. Aggregate tables can be either created externally or within the headless BI tool, and can significantly speed up queries and solve the cold cache problem when maintained in the background.
The prior era of multi-second dashboard loads is over, and it’s not coming back. By managing caching, a headless BI tool ensures sub-second responses in every downstream application, regardless of the number of requests or the volume of data in the underlying data store.
4. APIs
Finally, we get to the essential nature of headless BI’s “headlessness.” A headless BI tool must make its data accessible to every “head” application, be it data visualization, dashboarding, embedded analytics, or automation. This means that a headless BI tool must make its data available via various APIs.
The most obvious candidate here is a SQL endpoint. This enables data consumers to keep using dashboard tools, notebooks, and legacy applications, such as Tableau, that they formerly would have directly connected to a data warehouse.
The next additions are REST and GraphQL. These ensure that the same metrics that reach dashboards are also available to embedded analytics features, end-user-facing applications, and other innovative uses of your data.
By exposing APIs to connect discrete applications, a headless BI tool makes it possible to build new automations. Some examples include a workflow that alerts a sales team to contact a customer based on changes in her account usage, or an upstream data source for applications such as reverse ETL.
“Business intelligence” no longer solely consists of reacting to data in a dashboard; now, an intelligent business actually has actionable data—and acts on it.
Headless BI should be open source
We’ve just laid out the essential components of a headless BI system. In theory, this could all be furnished by a vendor in a vertically-integrated closed system. However, a headless BI system is truly valuable when it integrates into any data stack, made up of pieces from any number of vendors.
As a practical matter, this is only possible when multiple companies collaborate: each vendor can contribute connectors and optimizations to improve the compatibility of a headless BI tool with their tools.
Furthermore, headless BI is so important, and its benefits to the community are so great, that this technology must be made available to everyone. Its stewardship should be guided by community input and community values.
We’re proud of the community that has grown around Cube in the three years since we open-sourced our headless BI platform. Over 200 contributors have enriched our tool on our GitHub repo, and over 5000 users have joined our community Slack to trade tips and share feedback.
Headless BI: WIP
What next? There are many optimizations to add, tools with which to integrate, and powerful features to build. Along the way, we’re soliciting and building connectors to tools like legacy BI applications and gathering feedback in our Slack. Our Twitter DMs are open and our community meets every month. And, with this momentum, we’re hiring more bright minds to help build Cube full-time.
Together, we’ll bring data to life and power the next generation of consistent and powerful data applications.
Won’t you join us?
Top comments (1)
I don't know. I'm looking for an explanation,