Snowflake: JSON and Variant performance

JSON Variants are flexible, not free.

I love the way that the Snowflake VARIANT data type handles JSON. It's great for variable payloads on event-type data. But you don't get that flexibility without a performance and cost penalty.

Let's look at some examples.

Test Data

For these tests I'll create a row of 150-odd columns by joining multiple tables from the Snowflake TPCDS sample data, and our table will comprise 10,000,000 rows. I'm lucky that TPCDS has a unique name for every column, and I don't have to invent new names.

As usual, tests are conducted using an XSMALL warehouse.

    create or replace table base as
    select  top 10000000 * 
    from    sample_data.tpcds_sf10tcl_old.web_sales s
            join sample_data.tpcds_sf10tcl_old.date_dim sdt 
                on sdt.d_date_sk = s.ws_sold_date_sk
            join sample_data.tpcds_sf10tcl_old.item i 
                on i.i_item_sk = s.ws_item_sk
            join sample_data.tpcds_sf10tcl_old.customer bc 
                on bc.c_customer_sk = s.ws_bill_customer_sk
            join sample_data.tpcds_sf10tcl_old.customer_address ba 
                on ba.ca_address_sk = s.ws_bill_addr_sk
            join sample_data.tpcds_sf10tcl_old.web_site ws 
                on ws.web_site_sk = s.ws_web_site_sk
            join sample_data.tpcds_sf10tcl_old.web_page wp 
                on wp.wp_web_page_sk = s.ws_web_page_sk
            ;        

Now I'll create a copy of that table as columns, and another with one column being the JSON in a VARIANT.

    create or replace table ascolumns as
    select * from base;
    -- 36 seconds

    create or replace table asjson as
    select object_construct(*) as json
    from base;
    -- 266 seconds        

You'll immediately notice that OBJECT_CONSTRUCT() is quite an overhead. Creating JSON from tabular data isn't something you want to be doing frequently in production. This is partly because we are assembling the columns into a JSON structure, but also because of sub-columnarization. For up to 200 attributes in the variant, Snowflake will determine the appropriate data type for the attribute, and store its columnar metadata.

Querying JSON

This metadata means that the performance of sub-columnarized JSON columns is equivalent to that of standard tabular columns. Here's a quick example:

    begin
    for i in 1 to 100 do
    select  d_date as dt,
            sum(ws_net_profit) as pr
    from    ascol3
    where   ws_sold_date_sk = 2451496
    group by all
    order by d_date;
    end for;
    end;
    -- 12 seconds

    begin
    for i in 1 to 100 do
    select  json:D_DATE as dt,
            sum(json:WS_NET_PROFIT) as pr
    from    asjsn3
    where   json:WS_SOLD_DATE_SK = 2451496
    group by all
    order by dt;
    end for;
    end;
    -- 12 seconds        

The reason is that Snowflake has already cast this data to its appropriate type, and maintained sufficient metadata to enable fast retrieval. You can see from the following screenshot that Snowflake has the appropriate data type for each value.

Article content

Copying Data

Copying the table over itself, using INSERT OVERWRITE, demonstrates an interesting difference between the tabular and columnar representation.

Remember, the 'asjson' table has already stored the data as a JSON variant.

    insert overwrite into ascolumns
    select * from ascolumns;
    -- 34 seconds

    insert overwrite into asjson 
    select * from asjson;
    -- 145 seconds        

The reason for this difference is the overhead associated with sub-columnarization. The query goes through an UNPACK/PACK process when it is reading and writing the variant.

From a performance perspective, it is better to carry your data through transformations as tabular columns rather than JSON variants.

Should you use a STRING?

You don't have to keep JSON in a variant. An alternative approach is to keep it in a STRING, and use the PARSE_JSON() function to extract its values.

I can create a test data table similar to the columnar and variant tables.

    create or replace table asstring as
    select object_construct(*)::string as str 
    from base;
    -- 165 seconds        

Copying that table demonstrates that the performance of the STRING data is significantly faster than the JSON variant, which we previously recorded as 266 seconds, but not as fast as the tabular version which was 36 seconds.

    insert overwrite into asstring 
    select * from asstring;
    -- 46 seconds        

This technique is only useful if you don't need attributes from the JSON for predicates such as JOIN or WHERE clauses.

    select  parse_json (str):D_DATE as dt,
            sum(parse_json(str):WS_NET_PROFIT) as pr
    from    asstring
    where   parse_json(str):WS_SOLD_DATE_SK = 2451496
    group by all
    order by dt;
    -- 100 seconds        

The difference in performance is huge because there was no pruning opportunity from within the string, and we had to shred it for every evaluation. Just one iteration of that query took 6x longer than 100 iterations of the tabular or variant version.

But if you have a long chain of transformations through which your data must pass, and the JSON attributes are not required by those transformations, then this may be a useful technique to improve the performance of your pipeline.

If you use this approach, the time to convert it back to JSON at the end of the pipeline should be taken into account.

    create or replace table fromstring as
    select parse_json(str) as json from asstring;
    -- 217 seconds        

Wrap-Up: When to use JSON

Don't use JSON variants to stuff your whole object model into a row.

Take a clue from the data type "variant." If the attributes are constant from row to row, store them in table columns. If the attributes vary, then JSON can be a good choice.

If you want good performance from queries against a variant column, keep the total number of attributes in your JSON to less than two hundred. But beware that carrying these columns through multiple transformations has a performance cost.

If you have many transformations to complete before the final requirement for JSON, in which the JSON content is not required, consider moving it around as a string and converting it back to JSON at the final step.

Iliya Arshinkov

Data & Intelligence Leader | Digital Transformation | Snowflake AI Data Cloud

1mo

Ron Dunn, thanks for sharing- it is very insightful!

Guy Shaul

Principal Consultant at BRYK Group

1mo

Insightful

To view or add a comment, sign in

More articles by Ron Dunn

  • Snowflake: Case-Insensitive Performance

    Dinosaur moment: When I started in IT I had to worry about 26 upper-case letters, 10 digits, and a handful of control…

  • Snowflake: Dates are DATEs, not strings

    There is no good reason to store a date as anything other than a DATE. The same advice applies to timestamps and times,…

  • Snowflake: GUID and Hash Key Performance

    Following on from last week's post about the performance difference between integers and strings (integers are faster),…

    19 Comments
  • Snowflake: Integer vs String

    There was doubt from a couple of readers when I suggested, last week, that integer keys were faster than strings. Let's…

    5 Comments
  • Faster, Cheaper File Ingestion

    If you're ingesting staged Parquet files into Snowflake, INSERT is better than COPY INTO. The vectorized Parquet…

    9 Comments
  • [Don't] [Do] [This]

    [If] [Microsoft] [has] [left] [one] [indelible] [footprint] [on] [the] [world] [of] [SQL], [it] [is] [excessive] [use]…

    15 Comments
  • CTAS Alter-Switch vs Drop-Rename

    You need to be using ALTER-SWITCH rather than DROP-RENAME when swapping tables after a CTAS statement. The best way to…

    3 Comments
  • Azure SqlDW Pause/Resume with Automation RunAs Credentials

    That's probably the longest and most obscure heading I've ever written, but if you're planning to use Azure Automation…

    5 Comments
  • ORC + ADLS = Polybase

    Most people know that the fastest way to load Azure SQL Data Warehouse is through Polybase. And most people use…

    8 Comments
  • Groovy Data Lake Store

    Azure Data Lake Store is Microsoft's newest cloud storage and analytics platform. Groovy is a JVM scripting language.

    1 Comment

Insights from the community

Others also viewed

Explore topics