Snowflake: JSON and Variant performance
JSON Variants are flexible, not free.
I love the way that the Snowflake VARIANT data type handles JSON. It's great for variable payloads on event-type data. But you don't get that flexibility without a performance and cost penalty.
Let's look at some examples.
Test Data
For these tests I'll create a row of 150-odd columns by joining multiple tables from the Snowflake TPCDS sample data, and our table will comprise 10,000,000 rows. I'm lucky that TPCDS has a unique name for every column, and I don't have to invent new names.
As usual, tests are conducted using an XSMALL warehouse.
create or replace table base as
select top 10000000 *
from sample_data.tpcds_sf10tcl_old.web_sales s
join sample_data.tpcds_sf10tcl_old.date_dim sdt
on sdt.d_date_sk = s.ws_sold_date_sk
join sample_data.tpcds_sf10tcl_old.item i
on i.i_item_sk = s.ws_item_sk
join sample_data.tpcds_sf10tcl_old.customer bc
on bc.c_customer_sk = s.ws_bill_customer_sk
join sample_data.tpcds_sf10tcl_old.customer_address ba
on ba.ca_address_sk = s.ws_bill_addr_sk
join sample_data.tpcds_sf10tcl_old.web_site ws
on ws.web_site_sk = s.ws_web_site_sk
join sample_data.tpcds_sf10tcl_old.web_page wp
on wp.wp_web_page_sk = s.ws_web_page_sk
;
Now I'll create a copy of that table as columns, and another with one column being the JSON in a VARIANT.
create or replace table ascolumns as
select * from base;
-- 36 seconds
create or replace table asjson as
select object_construct(*) as json
from base;
-- 266 seconds
You'll immediately notice that OBJECT_CONSTRUCT() is quite an overhead. Creating JSON from tabular data isn't something you want to be doing frequently in production. This is partly because we are assembling the columns into a JSON structure, but also because of sub-columnarization. For up to 200 attributes in the variant, Snowflake will determine the appropriate data type for the attribute, and store its columnar metadata.
Querying JSON
This metadata means that the performance of sub-columnarized JSON columns is equivalent to that of standard tabular columns. Here's a quick example:
begin
for i in 1 to 100 do
select d_date as dt,
sum(ws_net_profit) as pr
from ascol3
where ws_sold_date_sk = 2451496
group by all
order by d_date;
end for;
end;
-- 12 seconds
begin
for i in 1 to 100 do
select json:D_DATE as dt,
sum(json:WS_NET_PROFIT) as pr
from asjsn3
where json:WS_SOLD_DATE_SK = 2451496
group by all
order by dt;
end for;
end;
-- 12 seconds
The reason is that Snowflake has already cast this data to its appropriate type, and maintained sufficient metadata to enable fast retrieval. You can see from the following screenshot that Snowflake has the appropriate data type for each value.
Copying Data
Copying the table over itself, using INSERT OVERWRITE, demonstrates an interesting difference between the tabular and columnar representation.
Remember, the 'asjson' table has already stored the data as a JSON variant.
Recommended by LinkedIn
insert overwrite into ascolumns
select * from ascolumns;
-- 34 seconds
insert overwrite into asjson
select * from asjson;
-- 145 seconds
The reason for this difference is the overhead associated with sub-columnarization. The query goes through an UNPACK/PACK process when it is reading and writing the variant.
From a performance perspective, it is better to carry your data through transformations as tabular columns rather than JSON variants.
Should you use a STRING?
You don't have to keep JSON in a variant. An alternative approach is to keep it in a STRING, and use the PARSE_JSON() function to extract its values.
I can create a test data table similar to the columnar and variant tables.
create or replace table asstring as
select object_construct(*)::string as str
from base;
-- 165 seconds
Copying that table demonstrates that the performance of the STRING data is significantly faster than the JSON variant, which we previously recorded as 266 seconds, but not as fast as the tabular version which was 36 seconds.
insert overwrite into asstring
select * from asstring;
-- 46 seconds
This technique is only useful if you don't need attributes from the JSON for predicates such as JOIN or WHERE clauses.
select parse_json (str):D_DATE as dt,
sum(parse_json(str):WS_NET_PROFIT) as pr
from asstring
where parse_json(str):WS_SOLD_DATE_SK = 2451496
group by all
order by dt;
-- 100 seconds
The difference in performance is huge because there was no pruning opportunity from within the string, and we had to shred it for every evaluation. Just one iteration of that query took 6x longer than 100 iterations of the tabular or variant version.
But if you have a long chain of transformations through which your data must pass, and the JSON attributes are not required by those transformations, then this may be a useful technique to improve the performance of your pipeline.
If you use this approach, the time to convert it back to JSON at the end of the pipeline should be taken into account.
create or replace table fromstring as
select parse_json(str) as json from asstring;
-- 217 seconds
Wrap-Up: When to use JSON
Don't use JSON variants to stuff your whole object model into a row.
Take a clue from the data type "variant." If the attributes are constant from row to row, store them in table columns. If the attributes vary, then JSON can be a good choice.
If you want good performance from queries against a variant column, keep the total number of attributes in your JSON to less than two hundred. But beware that carrying these columns through multiple transformations has a performance cost.
If you have many transformations to complete before the final requirement for JSON, in which the JSON content is not required, consider moving it around as a string and converting it back to JSON at the final step.
Data & Intelligence Leader | Digital Transformation | Snowflake AI Data Cloud
1moRon Dunn, thanks for sharing- it is very insightful!
Principal Consultant at BRYK Group
1moInsightful