BigQuery is taking the analytics industry by a storm. One of the most important use-cases is to extract and transform parts of a string. Below are some of the powerful constructs provided by BigQuery.
Regex Constructs
BigQuery provides the following constructs for regex operations
Construct | Arguments | Use |
---|---|---|
REGEXP_CONTAINS | REGEXP_CONTAINS(value, regexp) |
Checks if the string contains the regex |
REGEXP_EXTRACT | REGEXP_EXTRACT(value, regexp[, position[, occurrence]]) |
Extract the substring that matches the regex. |
REGEXP_EXTRACT_ALL | REGEXP_EXTRACT_ALL(value, regexp) |
Return all substrings that match the regex |
REGEXP_INSTR | REGEXP_INSTR(source_value, regexp [, position[, occurrence, [occurrence_position]]]) |
Return 1-based index for the first occurrence of the regex |
REGEXP_REPLACE | REGEXP_REPLACE(value, regexp, replacement) |
Replace all substrings in value |
Source : Google Cloud Platform
Table Setup
Let's create a simple table with some string data to play around.
-- Create Store
CREATE OR REPLACE TABLE `gcp-tour-276710.thefellowcoder.store`
(
date DATE,
details STRING,
address STRING
)
-- Add values
INSERT INTO `gcp-tour-276710.thefellowcoder.store`
VALUES
('2021-07-14', '<h1>Nike Shoes</h1><p>Base: $450 Addon: $20</p><p>Support email: help@nike.com</p>','Los Angeles, California 90001 '),
('2021-06-13', '<h1>Rebook Shoes</h1><p>Base: $150 Addon: $10</p><p>Support email: help@rebook.com</p>','Sacramento, California 94203 '),
('2021-07-11', '<h1>Adidas Shoes</h1><p>Base: $350 Addon: $30</p><p>Support email: help@adidas.com</p>','Beverly Hills, California 90209 '),
('2021-07-11', '<h1>Carnival All Rounder Shoes</h1><p>Base: $250 Addon: $0</p>','Beverly Hills, California 90209 ')
Sample Queries
Lets see each of the constructs in action
- REGEXP_CONTAINS example
-- All stores that have a support email
select * from `gcp-tour-276710.thefellowcoder.store` where REGEXP_CONTAINS(details, r'\S+@\S+\.\S+')
- REGEXP_EXTRACT example
-- Extract Product Names
select REGEXP_EXTRACT(details, r'<h1>([^<]+)<\/h1>') from `gcp-tour-276710.thefellowcoder.store`
- REGEXP_EXTRACT_ALL example
-- Extract all prices
select REGEXP_EXTRACT_ALL(details, r'\$(.\d+)') from `gcp-tour-276710.thefellowcoder.store`
- REGEXP_INSTR example
-- Get index of email
select details, REGEXP_INSTR(details, r'\S+@\S+\.\S+') from `gcp-tour-276710.thefellowcoder.store`
- REGEXP_REPLACE example
-- Replace all email-ids
select REGEXP_REPLACE(details, r'\S+@\S+\.\S+', 'admin@thefellowcoder.com') from `gcp-tour-276710.thefellowcoder.store`
Common Use-cases:
Extract part of the string with regex
This is one of the most common use-cases wherein we must extract substring that matches a regex.
-- Extract product names from all details
select REGEXP_EXTRACT(details, r'<h1>([^<]+)<\/h1>') from `gcp-tour-276710.thefellowcoder.store`
Replace regex with string
Let's say you want to sanitize the string and remove all special characters from the string.
-- remove special characters
select details,REGEXP_REPLACE(details, r'[^0-9a-zA-Z]+',' ') from `gcp-tour-276710.thefellowcoder.store`
Matches any of the regex patterns
check for multiple regex patterns for a string
-- check for pincodes starting with 90/94
select address,REGEXP_EXTRACT(address, r'9[0|4]\d+') as pincode from `gcp-tour-276710.thefellowcoder.store`
Useful regex examples for data cleaning.
String | Regex | Output |
---|---|---|
<h1>Hello World!</h1> |
<h1>([^<]+)<\/h1> |
Hello World! |
Support email: help@rebook.com |
\w+@\w+.\w+ | help@rebook.com |
Rebook ShOes | (?i)shoes | ShOes |
This,contains$special\characters | [^0-9a-zA-Z]+ | This contains special characters |
String needs to be trimmed. | ^(\s+).*?(\s+)$ | String needs to be trimmed. |
Hope you found this post useful, do check out my blog for more such content.
Top comments (0)