Imagine efficiency: right-click a file and instantly start analyzing with SQL in Synapse
In the last post we’ve dreamed a little about the if's and when's of a modern analytical data estate. My BigMac in this area is Azure Synapse Analytics. Even if I would not be a Microsoft employee and only looking at this suite of analytical tools that are integrated so nicely I would turn more and more a big fan.
Let’s examine the tight integration between the storage component and the most used “data language” SQL for example. Ok, you want Python too? Let me see...
When you provision a Synapse Workspace in your Azure subscription, you will always attach a Data Lake Storage by default to it. You can use an existing one or you can create a new one during the provisioning sequence.
And don’t get it wrong, a Data Lake in this context is “just” the storage component that will hold all the files that you want to analyze. You can store any file format there in any volume. If you want to bring PB, please go ahead. If you want to analyze CSVs, images or sound, go ahead.
The Data Lake Storage will even implement a first line of defence for you. It integrates with your Azure Active Directory and therefore with your Active Directory if you federate it. But it’s not only about access rights, the Data Lake Store will also allow the integration with virtual networks and deliver a firewall where you can narrow down traffic for selected vNets and IPs. So you can completely hide it from the rest of the world and only make it available from within your networks.
Once you’re ready provisioning and securing your environment you can open your Synapse Studio. This is one of the interfaces that you can use to interact with your analytical data estate. There are others too that you can use, Visual Studio and Visual Studio Code for example or SQL Server Management Studio (SSMS) but also Azure Data Studio can be used to interact. But for this time let’s use my favourite one and start the analysis.
I have teased using SQL against your data in your Data Lake. So let’s see, how we can get there.
In Synapse Studio you have several areas called hubs on the leftmost side of your window. There is the Data hub where you interact with different storage components. This is the one we will use for the upcoming SQL action. Additionally you will find the Development, the Integration, the Monitor and the Management hub for the different tasks that you can perform in your environment.
When you browse to your Data hub, you’ll find two sections right next to the hubs: the Workspace and the Linked section. For now let’s navigate to the Linked section and check the nodes that are presented there. You will see the default Data Lake Storage that you have attached to your Synapse environment and you’ll find a node for Integration datasets. We will cover that one in another post.
Let’s browse the Data Lake. And yes, it’s that easy. In your browser, in your Synapse Studio you have a nice explorer experience with all the features like creating folders, controlling access (read, write, execute for folders and files on AAD users and groups) and others.
And if you take a close look on the ribbon of the file browser area you can already derive, what we are about to do here.
In my tmp folder I have created another folder called sql_against_datalake and browsed there. Sure it’s empty now, but with the Upload button I can easily bring a file from my local drive into that folder.
Now this is where the fun starts. For file the formats.csv and .parquet you can now instantly start analyzing the contents of the files just by right-clicking the file and selecting New SQL script > Select TOP 100 rows.
You are taken to a SQL query window and presented a pre-formulated SQL script that selects the top 100 rows of your file by using the OPENROWSET() function of the Serverless SQL Pool of your Synapse Analytics Workspace.
Now let’s be bold and hit > Run and let’s see what happens.
Not the optimal output yet. OK, let’s adjust the SQL statement a little to reflect the configuration of the file and hit > Run again.
So there we are already. We have selected and displayed the content of our .csv file by using SQL. And the nice thing about it is the Synapse Workspace supported us in getting there.
From here you can take it now and for example adjust your SQL query to maybe display a sum of the amounts per product and then switch to Chart.
Not sure what you think about it, but I like it. And it’s not only about the ad-hoc experience. You have now the option to create views for example over your Data Lake files and make them available to other tools that “speak” SQL. You can think of your reporting tool like Power BI connecting to the serverless SQL Endpoint of Synapse to have access to all views that you create here .Maybe you want to analyze a bigger amount of data in your files and maybe you want to join in other data that you have available in files too?
Let’s go ahead and try that. I’ll add a file to my folder that offers me some product master data. I’ll use the upload button again in the Data Lake storage explorer like above. The file name is product_name_price.csv and that is too the content with some additional information like a category/sub category hierarchy.
So again I’ll browse my Data Lake to the folder, right-click the file and create a new query for it. This time there is no header row in the file. But I can name the columns in my SQL based on their occurrence and even influence their data type right in the query:
One of the named purposes in the documentation of serverless SQL Pools is the virtual Data Warehouse. This means the option to create a virtual DWH layer over your Data Lake and other storages that you can attach to your Synapse Workspace.
You can do this by creating views on top of your files for example. Let’s look how this is done with my query above. I’ll ad a CREATE VIEW statement before the SELECT, take away the TOP 100 and off we go.
Just as easy as that and now I have a view that will even update as soon as the file it is based on will be altered.
But let’s use that view now in the query with the product sales:
and again display the results in a chart instantly in the results view:
Really tempting isn’t it? I remember many projects, where this environment would have helped a lot.
What is your opinion about it?
Coming up:
Now maybe you face some hick-ups when you start trying these steps from this article for the first time. This comes due to the rigid security settings in your Data Lake and the Synapse Workspace. Let’s examine that together in the next article.
Other part of this series: