Experimenting with Databricks Volumes

Experimenting with Databricks Volumes

Volumes in Databricks Unity Catalog were introduced at this year’s Data + AI Summit, expanding data governance beyond tabular datasets.

What are volumes?

Supported in Databricks runtime 13.2 and above, volumes are Unity Catalog objects, just like tables or views (the catalog will be abbreviated as ‘UC’ for the remainder of this article).

Just like tables:

  • Volumes can be either managed or external. UC external locations can be used to register both external tables and external volumes. Managed volumes are created without specifying a location, as the access path is managed by UC.
  • They provide support for SQL commands, dbutils, Spark APIs, pandas, etc.
  • They work with Auto Loader and COPY INTO.
  • They reside in the third layer of UC’s 3-level namespace (<catalog>.<schema>.<volume>), which means they benefit from the UC security model, with no need for the ‘old’ mount points setups with security principals.
  • They use the same basic UC privilege model, but for working with files (CREATE, READ, WRITE, DROP, DESCRIBE, SHOW, ALTER)


No alt text provided for this image
Volumes - Overview

So what sets volumes apart from plain tables? Volumes — whether managed or external — can store non-tabular datasets, meaning files in any format, whether structured , semi-structured or unstructured: CSV file, images, audio, video, PDF, XML, etc. Basically, they represent a logical volume of data in arbitrary formats in cloud storage, referenced within the Databricks platform. As a side note, the ‘files in arbitrary format’ bit is the announced objective. As we’ll see later on, we are not quite there yet.

The next logical question would be, what’s the difference between volumes and storage mounts? We already have a way to load those same types of data from cloud storage using a service principal, so what’s the added value here? What’s new is that volumes are integrated into UC’s data management (security model, lineage tracking, etc.), and seamlessly extend UC ‘s governance from tabular datasets, to also get to cover a wide array of non-tabular data.

Volumes in practice

We’ll be using Azure Databricks for this little experiment ( but this applies of course to both GCP and AWS). Although we can create volumes using the UC Data Explorer (here’s the step-by-step guide on how to proceed), we’ll do it with code, using a DBR Runtime 13.2 cluster, starting with a managed volume:

No alt text provided for this image
Managed volume

To be a tad more specific on CREATE VOLUME permissions, USE CATALOG and USE SCHEMA privileges are also required as pre-conditions. Other volume commands have similar chains of hierarchical permissions.

Keep in mind that, since we are within the UC realm, we can setup permissions on that data using the Data Explorer, or in code, e.g.:

No alt text provided for this image

We could of course choose to be more granular when granting permissions:

No alt text provided for this image

Revoke previously granted privileges:

No alt text provided for this image

If we now switch over to the UC root location in the datalake :

No alt text provided for this image

We can see that our non-tabular and tabular data are segregated in their own separate folders.

Just like managed tables, managed volumes provide a convenient place for content without the need to manage external locations & storage credentials. Access to data is through paths managed by UC. Bear in mind that if a volume is dropped (e.g., with DROP VOLUME IF EXISTS a_catalog.a_schema.a_volume) , the underlying data will be also deleted. Note that dropping a volume requires owner status on that volume.

Up to this point, we mostly focused on permissions, Next, let’s see how we access files in volumes and track lineage.

Assuming that we previously created an external location with UC Data Explorer, we can create an external volume this way:

No alt text provided for this image

We uploaded both a CSV and a PDF file in that location. Let’s read the (fictitious) employee CSV file and save it as a UC table under the same schema:

No alt text provided for this image

Notice the access path ‘Volumes/<catalog>/<schema>/<volume-name>/' when accessing a volume, whether managed or external. This is essentially a FUSE mount that allows us to access those files as if they were in a local filesystem, meaning we could also use shell ( %sh) or dbutils.fs commands to interact with volumes.

Switching to the Data Explorer, we can verify that table lineage tracking now also includes volumes:

No alt text provided for this image

Clicking on the lineage graph button:

No alt text provided for this image

Supported data sources

At this time, UC’s external locations only support the following file formats: CSV, TSV, Avro, Parquet, and JSON. Even though it is still possible to process the PDF file the same way (i.e. using the same volume path, a library like tabula-py, and save as a UC table), the volume will not get picked up by lineage, and the graph will only display the new table. No errors will be displayed, but we won’t get the the lineage information.

Code:

No alt text provided for this image

Data Explorer:

No alt text provided for this image

This limitation also applies to managed volumes, and we can verify that by copying the PDF to the managed volume, and then retrying the read/write operation using the managed volume access path instead:

No alt text provided for this image
No alt text provided for this image

Still, no linage information is displayed on the managed volume, when using a PDF as data source:

In summary, it appears that, for now at least, volumes will behave as intended…provided the source data format is supported by UC. We would of course expect that list of supported formats to expand in the near future.

Final thoughts

Let’s wrap things up by listing all the volumes in that catalog and schema:

No alt text provided for this image
No alt text provided for this image

As of current writing, volumes are in the Public Preview phase and come with a few limitations. Still, this would be a good time to start experimenting with them. Both managed and external volumes can be used to access and manage files of non-tabular, supported formats. External volumes for instance, can play an important role in efficiently managing large collections of non-tabular datasets generated by external systems and accessed by Databricks, or vice versa. UC can then seamlessly extend its governance model to raw datasets in landing areas with zero-copy, and ensure a more comprehensive coverage of data pipelines.

To view or add a comment, sign in

More articles by Tony Siciliani

  • Databricks Lakehouse Federation

    Lakehouse Federation was announced at this year’s Data + AI Summit. The objective is to address data fragmentation and…

    1 Comment
  • UniForm with Databricks Delta Lake

    UniForm - the Delta Lake Universal Format - was one of the latest innovations presented at the Databricks Data + AI…

    1 Comment
  • Liquid Clustering with Databricks Delta Lake

    Databricks unveiled Liquid Clustering at this year's Data + AI Summit, a new approach aimed at improving both read and…

    3 Comments
  • Third generation data platforms: A first look at Microsoft Fabric

    Evolution Towards the end of 2019, Azure SQL Data Warehouse was integrated as 'Dedicated SQL pools' into Synapse…

Insights from the community

Others also viewed

Explore topics