The VARIANT data type is a recent introduction in Databricks (available in Databricks Runtime 15.3 and above) designed specifically for handling semi-structured data. It offers an efficient and flexible way to store and process this kind of data, which often has a dynamic or evolving schema.
Here's a quick rundown of its key features (As per the documentation of Databricks):
Flexibility: VARIANT can store various data structures within a single column, including structs, arrays, maps, and scalars. This eliminates the need for pre-defined schemas, making it adaptable to changing data formats.
Performance: Compared to storing semi-structured data as JSON strings, VARIANT offers significant performance improvements. This is because VARIANT uses a binary encoding scheme for data representation, allowing for faster processing.
Ease of Use: Databricks recommends VARIANT as the preferred choice over JSON strings for working with semi-structured data. It provides familiar syntax for querying fields and elements within the VARIANT column using dot notation and array indexing.
Overall, the VARIANT data type streamlines working with semi-structured data in Databricks, enhancing flexibility, performance, and ease of use.
Snowflake has long offered the VARIANT data type, allowing you to store semi-structured data without pre-defining a schema. This eliminates the burden of schema design upfront.
In contrast, for Delta Lake previously we relied on the MAP data type, which requires a defined schema. However, semi-structured data often exhibits schema variations across rows, creating challenges for data engineers. Parsing the data correctly before storage was a necessary but tedious step.
In this exploration, I'll try to uncover the VARIANT data type in Databricks and its underlying mechanisms.
Some important databricks documentation links
- https://docs.databricks.com/en/delta/variant.html
- https://docs.databricks.com/en/semi-structured/variant.html
- https://docs.databricks.com/en/semi-structured/variant-json-diff.html
Code Reference
Steps To Setup
Step 1: Provision a databricks cluster with a runtime environment 15.3
I created a Test Cluster with runtime 15.3 Beta version (Apache Spark 3.5.0, Scala: 2.12)
Step 2: Verify your workbook if it is running with 15.3
Step 3: Create a schema
For the first time, we are going to create the schema. You can use your own schema. Post creation, it should start reflecting in the catalog.
Step 4: Verify if parse_json function is present (from 15.3 version, this function should be present)
As per the documentation: https://docs.databricks.com/en/semi-structured/variant.html to parse json data of a column, you can use parse_json function. It will validate incoming data if it is in JSON format or not. Also, it will create VARIANT data type of the column if you are creating the table using select.
Step 5: Create a table
On this step, we are creating a table variant_data_exploration in the schema myschema parsing a json object As per the query, it will create 3 columns
- id: int
- name: string
- raw: variant
Step 6: Table Schema Verification
As you can see below, under the section, # Delta Statistics Columns,
Column Names: id, name, raw
Column Selection Method: first-32 (default behavior of a deltalake table. It gathers statistics for the first 32 columns)
Location: dbfs:/user/hive/warehouse/myschema.db/variant_data_exploration
Step 7: Verify Files in Table Location
[
FileInfo(path='dbfs:/user/hive/warehouse/myschema.db/variant_data_exploration/_delta_log/', name='_delta_log/', size=0, modificationTime=0),
FileInfo(path='dbfs:/user/hive/warehouse/myschema.db/variant_data_exploration/part-00000-603c8a87-dfdd-41a0-817d-9226cef0ab8a-c000.snappy.parquet', name='part-00000-603c8a87-dfdd-41a0-817d-9226cef0ab8a-c000.snappy.parquet', size=3943, modificationTime=1719567376000)
]
As we can see, there are _delta_log directory (having deltatable metadata/stats related files) and one parquet file (part-00000-603c8a87-dfdd-41a0-817d-9226cef0ab8a-c000.snappy.parquet) holding a single row.
Step 8: Verify Files in __delta_log location
[FileInfo(path='dbfs:/user/hive/warehouse/myschema.db/variant_data_exploration/_delta_log/00000000000000000000.crc', name='00000000000000000000.crc', size=2616, modificationTime=1719567381000),
FileInfo(path='dbfs:/user/hive/warehouse/myschema.db/variant_data_exploration/_delta_log/00000000000000000000.json', name='00000000000000000000.json', size=1741, modificationTime=1719567377000),
FileInfo(path='dbfs:/user/hive/warehouse/myschema.db/variant_data_exploration/_delta_log/_commits/', name='_commits/', size=0, modificationTime=0)]
Mainly it has 2 file types
- 00000000000000000000.json (Holds column statistics & responsible for data pruning/ file skipping. For each commit, a new json file with incremented value gets created)
- 00000000000000000000.crc (Every 10 transactions json files in the deltalog are converted to parquet files. The .crc file is a checksum added to prevent corruption if a parquet file is corrupted in flight)
Explore
Lets see the content inside 00000000000000000000.json as this is the main driving factor of data skipping & query performance.
---------------+----------------------------------------------------+
|add |commitInfo |metaData |protocol |
+---------------------------------------------------------------------------------------------
|NULL |{0628-070916-m4o60ack, Databricks-Runtime/15.3.x-scala2.12, false, WriteSerializable, {398134129842206}, CREATE OR REPLACE TABLE AS SELECT, {1, 3943, 1}, {[], NULL, true, [], {}, false}, {true, false}, 1719567376334, 472f6c8b-cd1d-4347-acd4-c49c2ebd8072, 8684924100662862, abc@gmail.com}|NULL |NULL |
|NULL |NULL |{1719567374807, {parquet}, c22760ef-1595-4a59-974a-2a4dbb3a1386, [], {"type":"struct","fields":[{"name":"id","type":"integer","nullable":true,"metadata":{}},{"name":"name","type":"string","nullable":true,"metadata":{}},{"name":"raw","type":"variant","nullable":true,"metadata":{}}]}}|NULL |
|NULL |NULL |NULL |{3, 7, [variantType-preview], [variantType-preview]}|
|{true, 1719567376000, part-00000-603c8a87-dfdd-41a0-817d-9226cef0ab8a-c000.snappy.parquet, 3943, {"numRecords":1,"minValues":{"id":1,"name":"abc"},"maxValues":{"id":1,"name":"abc"},"nullCount":{"id":0,"name":0,"raw":0}}, {1719567376000000, 1719567376000000, 1719567376000000, 268435456}}|NULL |NULL |NULL |
+---------------------------------------------------------------------------------------------
Observations
Check this stats section in delta log
{"numRecords":1,"minValues":{"id":1,"name":"abc"},"maxValues":{"id":1,"name":"abc"},"nullCount":{"id":0,"name":0,"raw":0}}
deltatable gathered min value & max value statistics for only id & name column not for raw. It means FILTER condition in a SELECT query on a VARIANT datatype shouldn't contribute to file skipping. You would still be dependent on other non-complex columns to have a file level data skipping.
What if we insert a NULL value then? Will it contribute in data skipping?
Now deltalog has version 00000000000000000001 available
Content of 00000000000000000001.json file
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|add |commitInfo |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|NULL |{0628-070916-m4o60ack, Databricks-Runtime/15.3.x-scala2.12, true, WriteSerializable, {398134129842206}, WRITE, {1, 1036, 1}, {Append, [], false}, 0, {true, false}, 1719572454185, 1b7a5e88-9f4b-4c9e-8af3-39a1b808b5cc, 8684924100662862, abc@gmail.com}|
|{true, 1719572454000, part-00000-477f86d9-19d7-462d-ab1d-e7891348b2a3-c000.snappy.parquet, 1036, {"numRecords":1,"minValues":{"id":2,"name":"def"},"maxValues":{"id":2,"name":"def"},"nullCount":{"id":0,"name":0,"raw":1}}, {1719572454000000, 1719572454000000, 1719572454000000, 268435456}}|NULL |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
See the plan of the query
select * from myschema.variant_data_exploration where raw is not null
Here no. of files read = 1 & no. of file pruned = 1. It means it skipped the file of the first commit. It means it contributed to file skipping. Why?
See the section 00000000000000000001.json file
"nullCount":{"id":0,"name":0,"raw":1}}
It means during the second commit, we have inserted the row with raw field is NULL & deltalake captured that stats. So, it is able to skip the file scan.
What if we insert a {} value then? Will it contribute in data skipping?
We all know, in many systems, while persisting a NULL semi-structured data, we persist the record as {}. It is kind of NULL representation of a JSON. Lets see how VARIANT responds to it.
Content of 00000000000000000002.json file
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|add |commitInfo |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|NULL |{0628-070916-m4o60ack, Databricks-Runtime/15.3.x-scala2.12, true, WriteSerializable, {398134129842206}, WRITE, {1, 1102, 1}, {Append, [], false}, 1, {true, false}, 1719573911083, 6a04ec9a-9a61-4875-a47d-7d26d14877cf, 8684924100662862, abc@gmail.com}|
|{true, 1719573911000, part-00000-4d9aa00d-2c82-4e96-b15c-16cba8b374a4-c000.snappy.parquet, 1102, {"numRecords":1,"minValues":{"id":3,"name":"ghi"},"maxValues":{"id":3,"name":"ghi"},"nullCount":{"id":0,"name":0,"raw":0}}, {1719573911000000, 1719573911000000, 1719573911000000, 268435456}}|NULL |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
Now if you see this below section, deltalake didn't consider {} as NULL value
"nullCount":{"id":0,"name":0,"raw":0}}
So, if we run a query like below, it will scan two files (1st transaction & 3rd transaction). See the below explain plan.
select * from myschema.variant_data_exploration where raw:fb::string ='abc';
Conclusion
- Variant provides huge flexibility in terms of storing semi-structured data
- For Variant data type, File pruning is possible if you are storing data as NULL not {} or any other
- As deltalake doesn't capture stats of internal fields of a variant column, if we query them it will result in loading all the parquet files (in a partition if a partitioned table) with NOT NULL variant data.
- If we are modelling the deltatable & our ETL is pushing NOT NULL values in variant, you can keep those columns outside of first 32 columns. Performance is expected to be the same.
Things to Follow-up
As far as I can remember, few years back, I was debugging a Variant column performance issue in Snowflake. At that time, either snowflake team or in some snowflake forum (Can't remember exactly) claimed, snowflake persists Variant data in a very flattened format in their storage. Also, it gathers stats of all the internal fields of a variant. That makes snowflake very unique as it doesn't have any performance impact between querying a normal column & an attribute within a Variant column. Not sure if it is true on today's date or not. Need to follow-up :)
Top comments (0)