Experimenting with Databricks Volumes
Volumes in Databricks Unity Catalog were introduced at this year’s Data + AI Summit, expanding data governance beyond tabular datasets.
What are volumes?
Supported in Databricks runtime 13.2 and above, volumes are Unity Catalog objects, just like tables or views (the catalog will be abbreviated as ‘UC’ for the remainder of this article).
Just like tables:
So what sets volumes apart from plain tables? Volumes — whether managed or external — can store non-tabular datasets, meaning files in any format, whether structured , semi-structured or unstructured: CSV file, images, audio, video, PDF, XML, etc. Basically, they represent a logical volume of data in arbitrary formats in cloud storage, referenced within the Databricks platform. As a side note, the ‘files in arbitrary format’ bit is the announced objective. As we’ll see later on, we are not quite there yet.
The next logical question would be, what’s the difference between volumes and storage mounts? We already have a way to load those same types of data from cloud storage using a service principal, so what’s the added value here? What’s new is that volumes are integrated into UC’s data management (security model, lineage tracking, etc.), and seamlessly extend UC ‘s governance from tabular datasets, to also get to cover a wide array of non-tabular data.
Volumes in practice
We’ll be using Azure Databricks for this little experiment ( but this applies of course to both GCP and AWS). Although we can create volumes using the UC Data Explorer (here’s the step-by-step guide on how to proceed), we’ll do it with code, using a DBR Runtime 13.2 cluster, starting with a managed volume:
To be a tad more specific on CREATE VOLUME permissions, USE CATALOG and USE SCHEMA privileges are also required as pre-conditions. Other volume commands have similar chains of hierarchical permissions.
Keep in mind that, since we are within the UC realm, we can setup permissions on that data using the Data Explorer, or in code, e.g.:
We could of course choose to be more granular when granting permissions:
Revoke previously granted privileges:
If we now switch over to the UC root location in the datalake :
We can see that our non-tabular and tabular data are segregated in their own separate folders.
Just like managed tables, managed volumes provide a convenient place for content without the need to manage external locations & storage credentials. Access to data is through paths managed by UC. Bear in mind that if a volume is dropped (e.g., with DROP VOLUME IF EXISTS a_catalog.a_schema.a_volume) , the underlying data will be also deleted. Note that dropping a volume requires owner status on that volume.
Up to this point, we mostly focused on permissions, Next, let’s see how we access files in volumes and track lineage.
Assuming that we previously created an external location with UC Data Explorer, we can create an external volume this way:
Recommended by LinkedIn
We uploaded both a CSV and a PDF file in that location. Let’s read the (fictitious) employee CSV file and save it as a UC table under the same schema:
Notice the access path ‘Volumes/<catalog>/<schema>/<volume-name>/' when accessing a volume, whether managed or external. This is essentially a FUSE mount that allows us to access those files as if they were in a local filesystem, meaning we could also use shell ( %sh) or dbutils.fs commands to interact with volumes.
Switching to the Data Explorer, we can verify that table lineage tracking now also includes volumes:
Clicking on the lineage graph button:
Supported data sources
At this time, UC’s external locations only support the following file formats: CSV, TSV, Avro, Parquet, and JSON. Even though it is still possible to process the PDF file the same way (i.e. using the same volume path, a library like tabula-py, and save as a UC table), the volume will not get picked up by lineage, and the graph will only display the new table. No errors will be displayed, but we won’t get the the lineage information.
Code:
Data Explorer:
This limitation also applies to managed volumes, and we can verify that by copying the PDF to the managed volume, and then retrying the read/write operation using the managed volume access path instead:
Still, no linage information is displayed on the managed volume, when using a PDF as data source:
In summary, it appears that, for now at least, volumes will behave as intended…provided the source data format is supported by UC. We would of course expect that list of supported formats to expand in the near future.
Final thoughts
Let’s wrap things up by listing all the volumes in that catalog and schema:
As of current writing, volumes are in the Public Preview phase and come with a few limitations. Still, this would be a good time to start experimenting with them. Both managed and external volumes can be used to access and manage files of non-tabular, supported formats. External volumes for instance, can play an important role in efficiently managing large collections of non-tabular datasets generated by external systems and accessed by Databricks, or vice versa. UC can then seamlessly extend its governance model to raw datasets in landing areas with zero-copy, and ensure a more comprehensive coverage of data pipelines.