Data Tokenization
Ever since I worked for a company, data protection has always been one of the top priorities as a developer. It is important to put extra care when dealing with data. However, it is more important to protect sensitive data systematically. If we cannot access sensitive data, we know that the data protection is in place already and our effort can be used for more productive activities.
As data engineers, inevitably we need to analyse the input data or verify the outputs from the algorithms on occasion. In that case, we still want to protect sensitive customer data including Personally Identifiable Information (PII), but it sometimes requires distinguishing the identity of the value. For example, I want to distinguish 'red', 'green' and 'blue' not like having 'red', 'green', 'blue' -> 'COLOR'.
# desirable input and output
# where some_value_{1|2|3} are not identical
red -> some_value_1
green -> some_value_2
blue -> some_value_3
One approach we can take is to tokenize or anonymize sensitive data so that the values are still available for meaningful data analysis. Tokenization is the process of removing actual sensitive data elements by replacing them with undecipherable tokens for data security purposes.
You may also decide to keep the original data to feed it into applications that strictly do not have any human interactions during the process. In this case, we need to ensure that the original data is hidden or well protected for human users. Now, we explore both cases in this post.
Column-Level Security in Snowflake: Masking Policy
Snowflake supports column-level security so that we can apply a masking policy to any column that needs to be hidden or needs more granular access in a table or a view. Any column can have a masking policy which can then be customized based on the roles.
Snowflake offers Dynamic Data Masking and External Tokenization for data tokenization.
"Dynamic Data Masking is a Column-level Security feature that uses masking policies to selectively mask data at query time that was previously loaded in plain-text into Snowflake." In other words, the underlying raw data does not change but the results of the query show masked/unmasked data depending on the role used at query time. Also, one of the benefits of Dynamic Data Masking is that we can easily apply the Snowflake built-in functions to a masking policy. Once a masking policy is applied to a column, the masking policy is working in anywhere the column is in use. Thus, the results of the query can vary depending on the role.
"External Tokenization allows organizations to tokenize sensitive data before loading that data into Snowflake and dynamically detokenize data at query runtime using masking policies with External Functions." This requires External Functions and it may need a tokenization provider. Opposite to Dynamic Data Masking, the default results of the queries are already tokenized to the column before the data is loaded in Snowflake. Thus, we need the masking policy to de-tokenize data at query time to retrieve the raw data.
In my case, the applications need the raw data for calculations, and an additional pipeline to implement External Tokenization adds up unnecessary complexity to our product. Therefore, I will apply the Dynamic Data Masking approach in this post.
Implementation
To have a look at a simple example, let's create a table called test
with the columns of 'ID', 'UUID, 'NAMES' and 'POSTCODE'.
create or replace table test(
id VARCHAR(36) default UUID_STRING(),
uuid VARCHAR(36) default UUID_STRING(),
names ARRAY,
postcode VARCHAR(20)
);
insert into test (id, uuid, names, postcode)
select UUID_STRING(), UUID_STRING(), PARSE_JSON($1), $2
from values
('["string 0", "string 1"]', '12345'),
('["name 1", "name 2", "name 3"]', 'A1B2C3'),
('["string 0", "string 1"]', 'XYZ123'),
('["string 0"]', 'postcode0'),
('["string 0", "string 1", "string 99"]', 'mypostcode');
In this example, the 'NAMES' field and the 'POSTCODE" field contain sensitive data. So, those columns will be applied with the masking policies.
Role
As mentioned earlier, we will see how we can apply different functions to different roles. Let's start creating a few roles.
The role of masking policy
We create one designated role MASK_ADMIN
for masking policies whose privileges of creating and applying a masking policy. This role can then be granted to another role when appropriate.
use role accountadmin;
create or replace role MASK_ADMIN;
grant usage on database <db_name> to role MASK_ADMIN;
grant usage on schema <db_name>.<schema_name> to role MASK_ADMIN;
grant all on table <db_name>.<schema_name>.TEST to role MASK_ADMIN;
grant usage on warehouse <warehouse_name> to role MASK_ADMIN;
grant create masking policy on schema <db_name>.<schema_name> to role MASK_ADMIN;
grant apply masking policy on account to role MASK_ADMIN;
grant role MASK_ADMIN to user <username>;
-- grant to another role when appropriate
grant role MASK_ADMIN to role <some_role>
The roles of applications
The roles for applications should have access to the raw data for calculations. I will create a dummy role called MASK_FULL_ACCESS
for simplicity.
use role accountadmin;
create or replace role MASK_FULL_ACCESS;
grant usage on database <db_name> to role MASK_FULL_ACCESS;
grant usage on schema <db_name>.<schema_name> to role MASK_FULL_ACCESS;
grant select on table <db_name>.<schema_name>.TEST to role MASK_FULL_ACCESS;
grant usage on warehouse <warehouse_name> to role MASK_FULL_ACCESS;
grant role MASK_FULL_ACCESS to user <username>;
The roles of human users
People must not be able to access any sensitive data in production. I will create two dummy roles called MASK_PARTIAL_ACCESS
and SNOWFLAKE_DATAENG
by replacing the same commands used for MASK_FULL_ACCESS
.
User Defined Functions for different access groups
We can hardcode the specific roles in the masking policy. However, this approach makes it harder to add or remove the roles from the masking policy when needed. This is because you cannot alter the masking policy when it is used in any columns. So, you must unset the masking policy in all the columns that are already set with the masking policy before dropping it.
To work around this drawback, I will create user-defined functions (UDF) to define the role groups of different data access levels. Always make sure that only certain roles can alter those UDFs for safety!
CREATE OR REPLACE FUNCTION <schema_name>.has_mask_full_access(role VARCHAR)
RETURNS BOOLEAN
LANGUAGE SQL
AS
$$
role in ('MASK_FULL_ACCESS', 'ACCOUNTADMIN')
$$
;
CREATE OR REPLACE FUNCTION <schema_name>.has_mask_partial_access(role VARCHAR)
RETURNS BOOLEAN
LANGUAGE SQL
AS
$$
role in ('MASK_PARTIAL_ACCESS', 'SNOWFLAKE_DATAENG')
$$
;
The MASK_ADMIN
role now needs the usage privileges on those functions.
use role accountadmin;
grant usage on function <db_name>.<schema_name>.has_mask_full_access(VARCHAR) to role MASK_ADMIN;
grant usage on function <db_name>.<schema_name>.has_mask_partial_access(VARCHAR) to role MASK_ADMIN;
Approach 1: Apply Dynamic Data Masking to columns
Here, I want to apply a hash function to sensitive data for a role with a partial data access level. Snowflake has the built-in SHA2 function (cryptographic hash). SHA2 returns a hex-encoded string containing the N-bit SHA-2 message digest, where N is the specified output digest size. Note that the SHA2 function has no corresponding decryption function since the length of the output is independent of the length of the input. The output does not necessarily have enough bits to hold all of the information from the input, so it is not possible to write a function that can decrypt all possible valid inputs.
Firstly, we create and apply a masking policy for the 'POSTCODE' field in the TEST table. The masking policy has to return the same data type as the input. So, we now create a masking policy for VARCHAR.
CREATE OR REPLACE MASKING POLICY <schema_name>.varchar_mask AS (val VARCHAR) RETURNS VARCHAR ->
CASE
WHEN <schema_name>.has_mask_full_access(current_role()) THEN val -- unmask
WHEN <schema_name>.has_mask_partial_access(current_role()) THEN SHA2(val) -- partial mask
ELSE '***MASKED***' -- mask
END;
ALTER TABLE IF EXISTS <schema_name>.test MODIFY COLUMN postcode SET MASKING POLICY <schema_name>.varchar_mask;
As we defined in has_mask_full_access
and has_mask_partial_access
, only MASK_FULL_ACCESS
and ACCOUNTADMIN
roles can retrieve the original postcode data. The role of MASK_PARTIAL_ACCESS
and SNOWFLAKE_DATAENG
roles can retrieve the encrypted values while the rest of the roles can only see ***MASKED***
for all postcodes.
Now, we create and apply a masking policy for the 'NAMES' field in the TEST table. Unfortunately, I couldn't find the optimal way of applying the SHA2
function to each element in an array. Thus, I decided to get all elements as one string and convert that with the SHA2 function.
CREATE OR REPLACE FUNCTION <schema_name>.sha2_array(SRC ARRAY)
RETURNS ARRAY
LANGUAGE SQL
AS $$
ARRAY_CONSTRUCT(SHA2(ARRAY_TO_STRING(SRC, ',')))
$$
;
CREATE OR REPLACE MASKING POLICY <schema_name>.array_mask AS (val ARRAY) RETURNS ARRAY ->
CASE
WHEN <schema_name>.has_mask_full_access(current_role()) THEN val -- unmask
WHEN <schema_name>.has_mask_partial_access(current_role()) THEN <schema_name>.sha2_array(val::ARRAY)::ARRAY -- partial mask
ELSE TO_ARRAY('***MASKED***') -- mask
END;
ALTER TABLE IF EXISTS <schema_name>.test MODIFY COLUMN names SET MASKING POLICY <schema_name>.array_mask;
The outputs would be good enough in many cases. However, it is not satisfying in my case. Since my use case requires tokenizing each element in an array individually, the calculation would be costly if we apply the masking policy each query time. Thus, I chose approach 2.
Approach 2: Create masked columns
The encryption would be quite expensive operation. Therefore, it might be more efficient to create additional masked columns and apply simple masking policies to them. For example, the 'NAMES' field in the TEST table can be customized to returning ['***MASKED***']
for standard roles, returning encrypted names, e.g., ['encrypted-name-1', 'encrypted-name-2', ..., 'encrypted-name-N']
for the data analysis role, or returning the raw NAMES for the application role. However, this approach of encrypting each value independently is expensive to process the masking policy each query time. As a result, we add additional columns to store the masked values in the final table.
Before we start, let's undo what we've applied by approach 1.
SELECT * FROM table(information_schema.policy_references(policy_name=>'<db_name>.<schema_name>.array_mask'));
ALTER TABLE IF EXISTS <db_name>.<schema_name>.test MODIFY COLUMN names UNSET MASKING POLICY;
DROP masking policy <db_name>.<schema_name>.array_mask;
SELECT * FROM table(information_schema.policy_references(policy_name=>'<db_name>.<schema_name>.varchar_mask'));
ALTER TABLE IF EXISTS <db_name>.<schema_name>.test MODIFY COLUMN postcode UNSET MASKING POLICY;
DROP masking policy <db_name>.<schema_name>.varchar_mask;
It is time to create additional masked columns and populate the masked values accordingly.
ALTER TABLE <schema_name>.test ADD COLUMN
masked_postcode VARCHAR(128) COMMENT 'Encrypted POSTCODE by SHA2',
masked_names ARRAY COMMENT 'Encrypted NAMES by SHA2';
-- populate the masked values
MERGE INTO <schema_name>.test
USING (
WITH
t AS (
SELECT
id,
SHA2(postcode) AS masked_postcode,
names
FROM <schema_name>.test
),
masked_names AS (
SELECT id AS masked_name_key, ARRAY_AGG(SHA2(VALUE::STRING)) WITHIN GROUP (ORDER BY SEQ) AS masked_names
FROM t, TABLE(FLATTEN(names, OUTER => FALSE)) GROUP BY id
)
SELECT *
FROM t
LEFT JOIN masked_names n ON t.id = n.masked_name_key
) s ON <schema_name>.test.id = s.id
WHEN MATCHED THEN UPDATE SET
masked_postcode = s.masked_postcode,
masked_names = s.masked_names;
The masked values are perfectly working for my use cases, However, we now have both the column that contains the raw data and the column that contains the masked data. First of all, we need to apply the masking policy to the raw data again. This time we will just use a simple one.
CREATE OR REPLACE MASKING POLICY <schema_name>.varchar_mask AS (val VARCHAR) RETURNS VARCHAR ->
CASE
WHEN <schema_name>.has_mask_full_access(current_role()) THEN val
ELSE '***MASKED***'
END;
ALTER TABLE IF EXISTS <schema_name>.test MODIFY COLUMN postcode SET MASKING POLICY <schema_name>.varchar_mask;
CREATE OR REPLACE MASKING POLICY <schema_name>.array_mask AS (val ARRAY) RETURNS ARRAY ->
CASE
WHEN <schema_name>.has_mask_full_access(current_role()) THEN val
ELSE ARRAY_CONSTRUCT('***MASKED***')
END;
ALTER TABLE IF EXISTS <schema_name>.test MODIFY COLUMN names SET MASKING POLICY <schema_name>.array_mask;
The original data is protected after applying the masking policies. Depending on your use cases, this may be enough or not. If you need extra data security read the next section.
Approach 2+ (optional): Create separate view definitions
As mentioned in the above section, the final table now contains both the original columns and the masked columns. You might want to have just one appropriate column each in the table. Or even worse, there can be catastrophic incidents if people are not careful when querying and updating the table because the same query will return different results according to the role. For example, when WHERE
or JOIN
clause is included in the query the returned result might look odd to the role that can only see the masked data, e.g., '***MASKED***'
. To prevent any unexpected updates, we create separate views for different data access levels. (Also, I think the original table should be hidden for most of the cases.)
In order to protect the table to be altered by the different roles having different levels of data access, I will create a new schema called TOKENIZED_DATA
and a new view definition called TEST
.
CREATE SCHEMA IF NOT EXISTS tokenized_data;
CREATE OR REPLACE VIEW tokenized_data.test AS
SELECT id, uuid, masked_postcode, masked_names
FROM <schema_name>.test;
USE ROLE accountadmin;
GRANT USAGE ON SCHEMA <db_name>.tokenized_data TO ROLE MASK_PARTIAL_ACCESS;
GRANT SELECT ON TABLE <db_name>.tokenized_data.test TO ROLE MASK_PARTIAL_ACCESS;
The name of masked columns starts with the 'masked_' prefix.
Consequently, when any query includes those columns, the name of the column should be corrected accordingly. For example, the query SELECT id, uuid, postcode FROM test
should be equivalent to SELECT id, uuid, masked_postcode FROM tokenized_data.test
.
I hope this post got you what you expected. Thanks for reading it!
Top comments (0)