Return of the Native: JSON and the RDBMS
The background to this article is Google’s announcement of what it calls a native JSON type in BigQuery. This prompted a comment from Snowflake product lead James Malone in this post. Google's Christopher Crosbie penned a friendly riposte to Malone here.
Both James and Christopher make some good points. I would argue, however, that Chris’ memory is slightly selective, at least with respect to the “giants” he recognizes as having pioneered "native" or in-database JSON. On the terms he describes, Teradata supports JSON as a “native” type; IBM’s approach with DB2 is notably different, but arguably best of all.
The pedantic point is that “native” JSON support is not a recent addition to the RDBMS’ feature set. The important point is that some of the terms being bandied about are imprecise, to say the least. This has the effect of diluting the meaning of these terms and of confusing customers.
In the first place, the use of the word “type” in this context is something of a misnomer. At the very least, the way Google is using this term expands on its technical meaning.
A “type” in a database or a programming language is not at all the same thing as a data interchange format, which is what JSON is.[i] You define, use, and enforce type constraints so that something (a compiler, a database engine, etc.) "knows" what they are, what operations it is permitted to perform on them, and what the results of these operations should look like.
In the second place, all of this quibbling over what makes JSON support in a database “native” is kind of beside the point. In the most useful implementations I have seen, “native” JSON support – be it in the form of a vendor-described “type” or some other scheme – makes it possible for users to perform the gamut of SQL operations on JSON data that is stored in some way in the RDBMS. This is the material point. The larger point is that users need not know (or care) about what this data is or about whether it derives from (or is stored “natively” as) JSON, XML, etc. objects. They can write/generate standard SQL to query JSON data much as they might query any other data that lives in the RDBMS.
Color me confused
All of this invites a question: what kind of “native” JSON support (“type” or otherwise) has Google implemented in BigQuery? For example, BigQuery documentation states that users must query using “JSON-value()” instead of column/field. If so, this is different from how JSON query works in Teradata or DB2. (I cannot comment on JSON support in MySQL, Oracle, or SQL Server because I have not experimented with it on these platforms.)
In Teradata, for example, I’d craft a SQL statement that returns a valid result without regard for whether the column in a database table is JSON or non-JSON:
SELECT * FROM Table.Foo WHERE City = ‘NYC’
In BigQuery, do I have to specify that a column is JSON? That is, must I write some variation of:
SELECT * FROM Table WHERE JSON-VALUE(Table.json_column.city) = “NYC”
Yes? Or no?
The two-fold point is that a “type” in an RDBMS is (a) manipulable via SQL – and that, (b) the RDBMS abstracts most/all of the nitty gritty of performing operations on the typed data.
Recommended by LinkedIn
From the user’s perspective, the details of RDBMS JSON support shouldn’t matter
In a certain sense, then, it doesn't matter how an RDBMS implements support for JSON data, so long as the database’s scheme conforms to the two-fold point I make above.
If I’m a DBMS vendor, I can ingest JSON as a special case of VARCHAR and store it in a new “type” – which I choose to call a “native” JSON type. (Again, the use of scare-quotes is intentional.) But I could also make a parser to store an ASCII CLOB so as to permit field-level access to JSON data. So, for example, shred into bits, store bits; or leave intact, retrieve object, and parse on demand. Again, in a certain sense, it doesn’t matter how my database implements this capability: if I have a database type that I label “JSON,” it might be shredded, or it might be intact; from the SQL querent’s perspective, this is below the waterline and should not matter.
What matters is what users are able to do with in-database JSON. For example:
What are the answers to these questions in BigQuery? In Snowflake? In DB2, MariaDB, MySQL, Oracle, Postgres, Redshift, SQL Server, Teradata, Yellowbrick, etc.?
This is (much) less a case of me pimping this or that DBMS engine than of trying to contextualize what is new/different about what Google, Snowflake, and other vendors are doing. Customers could benefit from more context, more precise usage (and definition) of terms, and, frankly, less marketing-a-ganda. Sadly, the opposite is usually the case.
Another point is that this pedantic quibbling has a distinct marketing function. In the first case, it enables Vendor A to claim that it is doing something new and potentially disruptive; in the second, it permits Vendor A to muddy the waters as to what its competitors are doing.
People need to work and solve problems. Full stop. What Google has done with JSON support in BigQuery is valuable insofar as it makes it possible for BigQuery customers to do new, different, and useful things. This is to be applauded – especially if you’re an existing BigQuery customer.
However, in-database support for JSON is not unique to BigQuery, nor is Google’s implementation of this feature in any sense new, different, or trend-setting. From the perspective of people who grok RDBMS internals, in fact, it may likewise seem to have distinct disadvantages, at least compared to more thoughtful implementations, such as IBM’s with DB2.
[i] In real terms, isn’t JavaScript a type-less interpreted language? Beyond that, a JSON “type” would just mean something mundane, such as an ability to operate on JSON objects, wouldn’t it? That is, not just an ability to store and retrieve JSON objects, but, moreover, an ability actually to do things to or with them?)
[ii] Genuinely curious about this one. Would love for @Christopher Crosbie or other BigQuery folks / users to let me know whether (or how) this works.
Data without analysis is a wasted asset. Analytics without action is wasted effort. I write compulsively and advise startups, established software vendors, investors and enterprises on data, analytics and AI strategy.
3yDude! Obscure.