matthieucham for Onepoint x Stack Labs

Posted on Jul 30, 2024

Automatically Update BigQuery View Schema Changes

#bigquery #eventdriven #googlecloud #dataengineering

SQL views are virtual tables simplifying data access and security. They offer tailored data perspectives, protecting sensitive information. Data analysts widely use them to streamline modeling.

As such, views are a crucial feature of Google Cloud's fully managed data warehouse, BigQuery. However, they have certain limitations. One of these limitations can be particularly troublesome for data analysts and end-users:

The schemas of the underlying tables are stored with the view when the view is created. If columns are added, deleted, or modified after the view is created, the view isn't automatically updated and the reported schema will remain inaccurate until the view SQL definition is changed or the view is recreated. Even though the reported schema may be inaccurate, all submitted queries produce accurate results.

To see this limitation into action, create a source table with two columns

CREATE OR REPLACE TABLE `demo_devto.source_table` (
  A STRING,
  B STRING
) AS (
  SELECT "a", "b" 
)

Then create a view above it

CREATE OR REPLACE VIEW `demo_devto.expo_view` AS (
  SELECT * from `demo_devto.source_table`
)

As expected, the schema of the view presents 2 columns A and B

Now add a column to the source table

ALTER TABLE `demo_devto.source_table`
  ADD COLUMN C STRING

The new column is reflected by the source table's schema

But not by the view's schema

Still, the result of a query is correct with 3 columns

This article outlines a method to circumvent this limitation and maintain the view's schema in alignment with the underlying table's schema as closely as possible.

A fully serverless event-driven architecture to synchronize schemas

This solution make use of a log sink to capture audit logs from BigQuery, a PubSub topic where relevant log entries are directed, a PubSub subscription and a Cloud Run service to process them

Let's review each step and dive into details

1. Bigquery audit logs

All Google Cloud services generate logs which are viewable in Cloud Logging. BigQuery is no exception and audit logs offer all information we need. See their structure here

2. Cloud Logging log sink

A log sink is a location where the logs are collected and stored. Google Cloud Logging log sinks collect within a scope - project, folder, organization. So to capture update logs from tables for a whole organization, a log sink at organization level is needed. To monitor a project only, a sink at project level is enough.

A log sink must declare a filter. This is very important to limit costs - which depend of the volume of captured logs - and to process relevant events only. Here we are using the following filter to capture events about schema changes:

resource.type="bigquery_resource"
AND protoPayload.serviceName="bigquery.googleapis.com"
AND protoPayload.methodName="tableservice.update"
AND protoPayload.authenticationInfo.principalEmail !~ <regex identifying the service account used by the cloud run service who process logs>

The filter on principalEmail serves as a mechanism to identify updates to exposition views made by the Cloud Run service, which we wish to exclude from processing as our focus lies solely on source table update events.

Finally, we need to give the sink a destination, where received logs who pass the filter are routed. Several kinds of destination are possible. Because our architecture is event-driven, the selected destination is a PubSub topic. The log entry is then encoded as JSON

Here is how to provision such a sink with Terraform, at project level:

resource "google_logging_project_sink" "demo" {
  provider               = google-beta
  project                = "my-project"
  name                   = "logsink-demo"
  destination            = "pubsub.googleapis.com/${google_pubsub_topic.demo.id}"
  filter                 = <<EOT
    resource.type="bigquery_resource"
    AND protoPayload.serviceName="bigquery.googleapis.com"
    AND protoPayload.methodName="tableservice.update"
    AND protoPayload.authenticationInfo.principalEmail !~ "^sa-demo@myproject.iam.gserviceaccount.com$"
  EOT
  unique_writer_identity = true
}

resource "google_pubsub_topic_iam_member" "demo" {
  provider = google-beta
  topic    = google_pubsub_topic.demo.id
  role     = "roles/pubsub.publisher"
  member   = google_logging_project_sink.demo.writer_identity
}

3. PubSub topic and subscription

The PubSub topic is the destination of log events who pass the log sink filter.

To consume these events, a subscription in PUSH mode send these events to a HTTPS endpoint.

Here is an example of how these resources can be provisioned with Terraform:

resource "google_pubsub_topic" "demo" {
  provider = google-beta
  name     = "topic-demo"
}

resource "google_pubsub_subscription" "demo" {
  provider             = google-beta
  name                 = "sub-demo"
  topic                = google_pubsub_topic.demo.id
  ack_deadline_seconds = 600

  push_config {
    push_endpoint = <URL of the cloud run endpoint>
    oidc_token {
      service_account_email = google_service_account.default.email
    }
  }

  expiration_policy {
    ttl = ""
  }
}

4. 5. and 6. Events processing

The processing of log events is performed by a Cloud Run service in this system, but could be done by a Cloud Function for example.

In Python, the decoding of incoming events can be done like this:

import base64
import json

bq_log = json.loads(base64.b64decode(message["data"]).decode("utf-8"))

By parsing the bq_log object, we can retrieve the updated table id:

import re

from google.cloud.bigquery import TableReference

RESOURCENAME_PATTERN = re.compile(
    "^projects/(?P<project>[^/]+)/datasets/(?P<dataset>[^/]+)/tables/(?P<table>[^/]+)$"
)

resource_name = bq_log.get("protoPayload", {}).get("resourceName", "")
if (match := RESOURCENAME_PATTERN.match(resource_name)) is not None:
    return TableReference.from_api_repr(
        {k + "Id": match.group(k) for k in ["project", "dataset", "table"]}
    )

The next step is to identify the views which relies on this source table. Here, associations between source tables and exposition views are registered in a Firestore database, but other designs are possible. For example, you could query INFORMATION_SCHEMA.VIEWS metadata views and identify the affected views by parsing the content of the VIEW_DEFINITION column

SELECT VIEW_DEFINITION FROM `demo_devto.INFORMATION_SCHEMA.VIEWS`

Finally, synchronize all affected views. BigQuery views seem to not support the updating of the "schema" field by the update_table() method when columns are added. The recommended way is then to re-create views with SQL DDL statements:

CREATE OR REPLACE VIEW AS ...

With all steps pieced together, any schema update from source tables automatically triggers the re-creation of exposition views, keeping the schema synchronized after a short delay !

Thanks for reading! I’m Matthieu, data engineer at Stack Labs.
If you want to discover the Stack Labs Data Platform or join an enthousiast Data Engineering team, please contact us.

Cover picture by Miguel Delmar on Unsplash