iamtodor for FreshBooks

Posted on May 26, 2023 • Edited on Jul 30, 2023

Manipulating Complex Structures in BigQuery: A Guide to DDL Operations

#bigquery

This guide aims to provide a comprehensive understanding of handling changes in complex structures within BigQuery using Data Definition Language (DDL) statements. It explores scenarios involving top-level columns as well as nested columns, addressing limitations with the existing on_schema_change configuration in dbt for BigQuery.

Introduction
How-to
Bonus notes
- Create a regular table
- Create an external table
Contact info

Introduction

Currently, dbt's on_schema_change configuration only tracks schema changes related to top-level columns in BigQuery. Nested column changes, such as adding, removing, or modifying a STRUCT, are not captured. This guide delves into extending the functionality of on_schema_change to encompass nested columns, enabling a more comprehensive schema change tracking mechanism. What exactly are top-level as well as nested ones I'm going to show further.

Moreover, it's important to note that BigQuery explicitly states on their Add a nested column to a RECORD column page that adding a new nested field to an existing RECORD column using a SQL DDL statement is not supported:

Adding a new nested field to an existing RECORD column by using a SQL DDL statement is not supported.

When it comes to drops one or more columns, from ALTER TABLE DROP COLUMN statement
page:

You cannot use this statement to drop the following:

Partitioned columns

Clustered columns

Nested columns inside existing RECORD fields

Columns in a table that has row access policies

This is the ongoing proposal with the discussion around
on_schema_change should handle non-top-level schema changes topic.

This limitation further highlights the need for alternative approaches to manipulate complex structures.

How-to

Define schema

To begin, let's dive into the SQL syntax and create the "person" table. This table will store information about individuals, including their ID, name, and address.



CREATE TABLE IF NOT EXISTS dataset_name.person (
    id INT64,
    name STRING,
    address STRUCT <
        country STRING,
        city STRING 
    >
)

Add records

Add a couple of records to the table.



INSERT INTO
    dataset_name.person (id, name, address)
VALUES
    (1, "John", STRUCT("USA", "New-York")),
    (2, "Jennifer", STRUCT("Canada", "Toronto"))

How schema in UI looks like

How data is represented while querying it with



SELECT
    *
FROM
    dataset_name.person

Add top-level field

Imagine we were tasked to add a new field has_car, that has an INT64 type.



ALTER TABLE
    dataset_name.person
ADD
    COLUMN IF NOT EXISTS has_car INT64;

-- add record right away
INSERT INTO
    dataset_name.person (id, name, has_car, address)
VALUES
    (3, "James", 0, STRUCT("USA", "New-York"))

When you add a new column to an existing BigQuery table, the past records will have null values for that newly added column. This behavior is expected because the new column was not present at the time those records were inserted.

Change top-level field type

Then your customer changes their mind and now the has_car column has to have a BOOL type instead of INT64. Here are 2 possible ways to tackle this task.

Before diving deep into the possible approaches, worth to mention, that BigQuery has Conversion rules, that you need to consider. For
instance, you can cast BOOL to INT64, but you cannot cast INT64 to DATETIME.

In BigQuery, CAST and ALTER COLUMN are two different approaches for modifying the data type of a column in a table.
Let's explore each approach:

`CAST`

The CAST() function is used to convert the data type of a column or an expression in a SQL query. It allows you to convert a column from one data type to another during the query execution. However, it does not permanently modify the data type of the column in the table's schema.

The following is an example of using CAST to convert a column's data type in a query:



SELECT
    id,
    name,
    address,
    CAST(has_car as BOOL) as has_car
FROM
    dataset_name.person

`ALTER COLUMN`

The ALTER COLUMN statement is used to modify the data type of a column in the table's schema. It allows you to permanently change the data type of a column in the table, affecting all existing and future data in that column.

Here's an example of using ALTER COLUMN to modify the data type of a column:



ALTER TABLE
    dataset_name.person
ALTER COLUMN
    has_car
SET
    DATA TYPE BOOL;

It's important to note that ALTER COLUMN is a DDL statement and can only be executed as a separate operation outside of a regular SQL query. Once the column's data type is altered, it will affect all future operations and queries performed on that table.

In summary, CAST is used to convert the data type of a column during query execution, while ALTER COLUMN is used to permanently modify the data type of a column in the table's schema. The choice between the two depends on whether you want to temporarily convert the data type for a specific query or permanently change the data type for the column in the table.

Juggle with STRUCT

If we want to apply changes to nested fields, such as adding, removing, or modifying STRUCT itself there are few different ways to do so.

temp table

First, quite simple is using the temp table.



CREATE TABLE IF NOT EXISTS dataset_name.person_tmp (
    id INT64,
    name STRING,
    has_car INT64,
    address STRUCT <
        country STRING,
        city STRING,
        zip_code INT64
    >
);

-- fill then new zip_code field with the default 0 value
INSERT INTO
    dataset_name.person_tmp
SELECT
    id,
    name,
    has_car,
    (
        SELECT
            as STRUCT address.country,
            address.city,
            0 as zip_code
    ) as address
FROM
    dataset_name.person;

ALTER TABLE
    IF EXISTS `dataset_name.person` RENAME TO `person_past`;

ALTER TABLE
    IF EXISTS `dataset_name.person_tmp` RENAME TO `person`;

DROP TABLE dataset_name.person_tmp;

However, this approach has some drawbacks and considerations to keep in mind: when modifying a BigQuery table using a temporary table, you need to create a new table with the desired modifications and then copy the data from the original table to the temporary table.

Costs

As this process involves duplicating the data. It will increase storage usage, leading to additional storage costs as well as it consumes additional query processing resources.

Performance

It may impact performance, especially for large tables as you have a limited amount of production resources that are shared.

Complexity and consistency

Using a temporary table to modify a BigQuery table introduces additional steps and complexity to the process. You need to write queries to create the temporary table, copy data, modify the data, overwrite the original table, and then drop the temporary table. This adds complexity to the overall workflow and may require more code and query execution time.

Last, but not least, during the modification process, there might be a period where the original table is not accessible or is in an inconsistent state. If other processes or applications depend on the original table's data, this downtime or inconsistency could impact their operations.

So this is not the very best way.

update STRUCT using SET

Another scenario is to change the nested field type. Imagine we would like to update the zip_code type from STRING to INT64. Now we don't want to use the tmp table way. So the second way is to UPDATE STRUCT using



ALTER TABLE
    dataset_name.person
ADD
    COLUMN IF NOT EXISTS address_new STRUCT < country STRING,
    city STRING,
    zip_code STRING >;

UPDATE
    `dataset_name.person`
SET
    address_new = (
        SELECT
            AS STRUCT address.country,
            address.city,
            CAST(address.zip_code as STRING)
    )
WHERE
    TRUE;

ALTER TABLE
    dataset_name.person RENAME COLUMN address TO address_past;

ALTER TABLE
    dataset_name.person RENAME COLUMN address_new TO address;

ALTER TABLE
    dataset_name.person DROP COLUMN address_past;

In this case, only the STRUCT field will be duplicated. That is good enough.

CREATE OR REPLACE TABLE

Another last approach is using CREATE OR REPLACE TABLE.



CREATE
OR REPLACE TABLE dataset_name.person AS
SELECT
    id,
    name,
    has_car,
    (
        SELECT
            AS STRUCT address.country,
            address.city,
            CAST(address.zip_code as STRING)
    ) as address
FROM
    dataset_name.person

In the same way, we can remove nested fields. We can just select the needed fields and omit the ones we don't interested in.



SELECT
    address.*
FROM
    `dataset_name.person`;

SELECT
    * REPLACE (
        (
            SELECT
                AS STRUCT address.*
            EXCEPT
                (zip_code)
        ) AS address
    )
FROM
    `dataset_name.person`

Bonus notes

If you have some table schema from a separate dataset, that you need to create in your particular dataset the easiest the way is using CLI commands as it's a much faster and less error-prone way to create tables.

Create a regular table

This is the example of how to save table schema using Table ID to JSON format with bq show command



bq show \
    --schema \
    --format=prettyjson \
    project_name:dataset_name.table_name > table_name.json

And now you can create a table in your dataset using bq mk command:



bq mk \
    --table \
    your_dataset_name.table_name \
    table_name.json

Create an external table

Here is the example of creating a table definition in JSON format using bq mkdef:



bq mkdef \
    --source_format=NEWLINE_DELIMITED_JSON \
    --autodetect=false \
    'gs://bucket_name/prefix/*.json' > table_def

The mkdef command is to create a table definition in JSON format for data stored in Cloud Storage or Google Drive. It will be used to create an external table.



bq mk \

    --table \

    --external_table_definition=nicereply_csat_raw_def \

    dataset_name.table_name \

    table_name.json

Contact info

If you found this article helpful, I invite you to connect with me on LinkedIn. I am always looking to expand my network and connect with like-minded individuals in the data industry. Additionally, you can also reach out to me for any questions or feedback on the article. I'd be more than happy to engage in a conversation and help out in any way I can. So don’t hesitate to contact me, and let’s connect and learn together.

Top comments (1)

Peggy Ma • May 30 '23

Awesome! Thanks for sharing!

DEV Community