The Problem: Caching does not work with wildcard tables
From BigQuery official docs:
Let's say you have some tables named my_data_2023_*
, where the asterisk represents various months. You want to analyze data across all these tables. Since BigQuery doesn't know automatically when new tables were created, it will invalidate any available cache and run a fresh query, so cache won't be used.
Just for reference, it's not a good practice to use date sharded tables:
Recently I faced a scenario where tables where dynamically created based on a business domain field, the date example is only for illustration purposes, if you are using sharded tables, the better solution is to migrate it to BigQuery partitions instead.
The Solution: Union THEM ALL!
Enter the BigQuery Information Schema:
The BigQuery INFORMATION_SCHEMA
views are read-only, system-defined views that provide metadata information about your BigQuery objects.
We can use the tables
view to dynamically generate a list of all tables matching our pattern (e.g., my_data_2023_*
). Then, we leverage UNION to combine individual queries for each identified table.
Here's a sample using Python:
from google.cloud import bigquery
client = bigquery.Client()
# Specify the dataset and wildcard pattern
dataset_id = "your-project.your_dataset"
wildcard_pattern = "my_data_2023_"
# Query the INFORMATION_SCHEMA to get matching table names
query = f"""
SELECT table_name
FROM `{dataset_id}.INFORMATION_SCHEMA.TABLES`
WHERE table_name LIKE '{wildcard_pattern}%'
"""
rows = list(client.query(f"SELECT table_name FROM `{dataset_id}.INFORMATION_SCHEMA.TABLES` "
f"where table_name like '{your_table_prefix_}%'"))
if not rows:
return
view_query = __create_sql(dict(rows[0])["table_name"])
for row in table_names[1:]:
view_query = f"""
{view_query}
UNION ALL
{__create_sql(dict(row['table_name'])}
"""
I omitted the __create_sql
function, which is just a logic that creates a complex SQL based on each table name, with the generated SQL then you can use it to create a BigQuery view:
view = bigquery.Table(table_ref)
view.view_query = view_query
client.create_table(view, exists_ok=True)
Hope that helps, cheers!
Top comments (0)