This guide aims to provide a comprehensive understanding of handling changes in complex structures within BigQuery using Data Definition Language (DDL) statements. It explores scenarios involving top-level columns as well as nested columns, addressing limitations with the existing on_schema_change
configuration in dbt for BigQuery.
Introduction
Currently, dbt's on_schema_change
configuration only tracks schema changes related to top-level columns in BigQuery. Nested column changes, such as adding, removing, or modifying a STRUCT
, are not captured. This guide delves into extending the functionality of on_schema_change
to encompass nested columns, enabling a more comprehensive schema change tracking mechanism. What exactly are top-level
as well as nested
ones I'm going to show further.
Moreover, it's important to note that BigQuery explicitly states on their Add a nested column to a RECORD column page that adding a new nested field to an existing RECORD column using a SQL DDL statement is not supported:
Adding a new nested field to an existing RECORD column by using a SQL DDL statement is not supported.
When it comes to drops one or more columns, from ALTER TABLE DROP COLUMN statement
page:
You cannot use this statement to drop the following:
- Partitioned columns
- Clustered columns
- Nested columns inside existing RECORD fields
- Columns in a table that has row access policies
This is the ongoing proposal with the discussion around
on_schema_change should handle non-top-level schema changes topic.
This limitation further highlights the need for alternative approaches to manipulate complex structures.
How-to
Define schema
To begin, let's dive into the SQL syntax and create the "person" table. This table will store information about individuals, including their ID, name, and address.
CREATE TABLE IF NOT EXISTS dataset_name.person (
id INT64,
name STRING,
address STRUCT <
country STRING,
city STRING
>
)
Add records
Add a couple of records to the table.
INSERT INTO
dataset_name.person (id, name, address)
VALUES
(1, "John", STRUCT("USA", "New-York")),
(2, "Jennifer", STRUCT("Canada", "Toronto"))
How schema in UI looks like
How data is represented while querying it with
SELECT
*
FROM
dataset_name.person
Add top-level field
Imagine we were tasked to add a new field has_car
, that has an INT64
type.
ALTER TABLE
dataset_name.person
ADD
COLUMN IF NOT EXISTS has_car INT64;
-- add record right away
INSERT INTO
dataset_name.person (id, name, has_car, address)
VALUES
(3, "James", 0, STRUCT("USA", "New-York"))
When you add a new column to an existing BigQuery table, the past records will have null values for that newly added column. This behavior is expected because the new column was not present at the time those records were inserted.
Change top-level field type
Then your customer changes their mind and now the has_car
column has to have a BOOL
type instead of INT64
. Here are 2 possible ways to tackle this task.
Before diving deep into the possible approaches, worth to mention, that BigQuery has Conversion rules, that you need to consider. For
instance, you can cast BOOL
to INT64
, but you cannot cast INT64
to DATETIME
.
In BigQuery, CAST
and ALTER COLUMN
are two different approaches for modifying the data type of a column in a table.
Let's explore each approach:
CAST
The CAST()
function is used to convert the data type of a column or an expression in a SQL query. It allows you to convert a column from one data type to another during the query execution. However, it does not permanently modify the data type of the column in the table's schema.
The following is an example of using CAST
to convert a column's data type in a query:
SELECT
id,
name,
address,
CAST(has_car as BOOL) as has_car
FROM
dataset_name.person
ALTER COLUMN
The ALTER COLUMN
statement is used to modify the data type of a column in the table's schema. It allows you to permanently change the data type of a column in the table, affecting all existing and future data in that column.
Here's an example of using ALTER COLUMN
to modify the data type of a column:
ALTER TABLE
dataset_name.person
ALTER COLUMN
has_car
SET
DATA TYPE BOOL;
It's important to note that ALTER COLUMN
is a DDL statement and can only be executed as a separate operation outside of a regular SQL query. Once the column's data type is altered, it will affect all future operations and queries performed on that table.
In summary, CAST
is used to convert the data type of a column during query execution, while ALTER COLUMN
is used to permanently modify the data type of a column in the table's schema. The choice between the two depends on whether you want to temporarily convert the data type for a specific query or permanently change the data type for the column in the table.
Juggle with STRUCT
If we want to apply changes to nested fields, such as adding, removing, or modifying STRUCT
itself there are few different ways to do so.
temp table
First, quite simple is using the temp
table.
CREATE TABLE IF NOT EXISTS dataset_name.person_tmp (
id INT64,
name STRING,
has_car INT64,
address STRUCT <
country STRING,
city STRING,
zip_code INT64
>
);
-- fill then new zip_code field with the default 0 value
INSERT INTO
dataset_name.person_tmp
SELECT
id,
name,
has_car,
(
SELECT
as STRUCT address.country,
address.city,
0 as zip_code
) as address
FROM
dataset_name.person;
ALTER TABLE
IF EXISTS `dataset_name.person` RENAME TO `person_past`;
ALTER TABLE
IF EXISTS `dataset_name.person_tmp` RENAME TO `person`;
DROP TABLE dataset_name.person_tmp;
However, this approach has some drawbacks and considerations to keep in mind: when modifying a BigQuery table using a temporary table, you need to create a new table with the desired modifications and then copy the data from the original table to the temporary table.
Costs
As this process involves duplicating the data. It will increase storage usage, leading to additional storage costs as well as it consumes additional query processing resources.
Performance
It may impact performance, especially for large tables as you have a limited amount of production resources that are shared.
Complexity and consistency
Using a temporary table to modify a BigQuery table introduces additional steps and complexity to the process. You need to write queries to create the temporary table, copy data, modify the data, overwrite the original table, and then drop the temporary table. This adds complexity to the overall workflow and may require more code and query execution time.
Last, but not least, during the modification process, there might be a period where the original table is not accessible or is in an inconsistent state. If other processes or applications depend on the original table's data, this downtime or inconsistency could impact their operations.
So this is not the very best way.
update STRUCT using SET
Another scenario is to change the nested field type. Imagine we would like to update the zip_code
type from STRING
to INT64
. Now we don't want to use the tmp
table way. So the second way is to UPDATE STRUCT
using
ALTER TABLE
dataset_name.person
ADD
COLUMN IF NOT EXISTS address_new STRUCT < country STRING,
city STRING,
zip_code STRING >;
UPDATE
`dataset_name.person`
SET
address_new = (
SELECT
AS STRUCT address.country,
address.city,
CAST(address.zip_code as STRING)
)
WHERE
TRUE;
ALTER TABLE
dataset_name.person RENAME COLUMN address TO address_past;
ALTER TABLE
dataset_name.person RENAME COLUMN address_new TO address;
ALTER TABLE
dataset_name.person DROP COLUMN address_past;
In this case, only the STRUCT
field will be duplicated. That is good enough.
CREATE OR REPLACE TABLE
Another last approach is using CREATE OR REPLACE TABLE
.
CREATE
OR REPLACE TABLE dataset_name.person AS
SELECT
id,
name,
has_car,
(
SELECT
AS STRUCT address.country,
address.city,
CAST(address.zip_code as STRING)
) as address
FROM
dataset_name.person
In the same way, we can remove nested fields. We can just select the needed fields and omit the ones we don't interested in.
SELECT
address.*
FROM
`dataset_name.person`;
SELECT
* REPLACE (
(
SELECT
AS STRUCT address.*
EXCEPT
(zip_code)
) AS address
)
FROM
`dataset_name.person`
Bonus notes
If you have some table schema from a separate dataset, that you need to create in your particular dataset the easiest the way is using CLI commands as it's a much faster and less error-prone way to create tables.
Create a regular table
This is the example of how to save table schema using Table ID to JSON format with bq show command
bq show \
--schema \
--format=prettyjson \
project_name:dataset_name.table_name > table_name.json
And now you can create a table in your dataset using bq mk command:
bq mk \
--table \
your_dataset_name.table_name \
table_name.json
Create an external table
Here is the example of creating a table definition in JSON format using bq mkdef:
bq mkdef \
--source_format=NEWLINE_DELIMITED_JSON \
--autodetect=false \
'gs://bucket_name/prefix/*.json' > table_def
The mkdef
command is to create a table definition in JSON format for data stored in Cloud Storage or Google Drive. It will be used to create an external table.
bq mk \
--table \
--external_table_definition=nicereply_csat_raw_def \
dataset_name.table_name \
table_name.json
Contact info
If you found this article helpful, I invite you to connect with me on LinkedIn. I am always looking to expand my network and connect with like-minded individuals in the data industry. Additionally, you can also reach out to me for any questions or feedback on the article. I'd be more than happy to engage in a conversation and help out in any way I can. So don’t hesitate to contact me, and let’s connect and learn together.
Top comments (1)
Awesome! Thanks for sharing!