Azure Data Engineering Tips & Tricks: Synapse Analytics Collation and CSV files
In this article, I'm going through the concept of collation in Azure Synapse Analytics serverless SQL pool and how to address the limitation of the master database default collation when working with CSV files.
One of the main compute engines in Azure is the serverless SQL pool, which allows us to interact directly with external files in the form of object storage (like files stored in a blob or ADLS2). It is a pay-as-you-go service that charges based on the amount of data processed and moved.
When you initially start querying your file in a serverless SQL pool, you're essentially using the Master Database.
The Problem:
While the serverless engine offers fast and convenient access to explore data files without the need to define any table schemas and undergo traditional ETL processes, there are some shortcomings. One of the shortcomings is the limitations in setting the collation of the master database (which you'll encounter when working with CSV files encoded in UTF-8). I'll explain this limitation using an example and then discuss how to address it.
What is Collation?
First, let's get to know what's collation in a database? As the Microsoft documentation explains:
Collations in SQL Server provide sorting rules, case, and accent sensitivity properties for your data.
what does it mean? what are accent sensitivity properties? in general, it means the rules in a database for comparing and sorting the characters with each other. for example, when you want to find rows in which "cafe" can be found, should the database engine return rows that contain "CAFE", "Cafe", "cAFe", and even "Café" or not? These rules are set by the default collation of the database.
Example
Let's go through an example and see what happens.
first, you need to create a synapse analytics workspace and its corresponding ADLS2 storage account. Then, you can upload one of the csv files of this dataset into your ADLS2 account.
We start by querying the top 100 rows of the file:
SELECT
TOP 100 *
FROM
OPENROWSET(
BULK 'https://your_adls2_account_name.dfs.core.windows.net/path_to_csv_file.csv',
FORMAT = 'CSV',
PARSER_VERSION = '2.0',
HEADER_ROW = TRUE
) AS [result]
everything seems to be fine:
but there might be an issue in the messages part of the query results:
for all the string columns, we received this error in the messages part:
Potential conversion error while reading VARCHAR column '....' from UTF8 encoded text. Change database collation to a UTF8 collation or specify explicit column schema in WITH clause and assign UTF8 collation to VARCHAR columns.
Why does this happen?
if we check the default collation of the master database, we see that its default collation is SQL_Latin1_General_CP1_CI_AS.
Recommended by LinkedIn
SELECT name, collation_name
FROM sys.databases;
Keep in mind that this collation is not compatible with UTF8-encoded files (which is the default encoding for most CSV files).
Now, let’s examine it more and have a look at how the open row set has inferred the schema of the file. We use a well-known stored procedure in the serverless sql pool that returns the inferred data types by the OPENROWSET function. Its name is "sp_describe_first_result_set":
EXEC sp_describe_first_result_set N'SELECT
TOP 100 *
FROM
OPENROWSET(
BULK ''https://your_adls2_account_name.dfs.core.windows.net/path_to_csv_file.csv'',
FORMAT = ''CSV'',
PARSER_VERSION = ''2.0'',
HEADER_ROW = TRUE
) AS [result]
as you see, for the string-based columns, it has inferred a very large length. we want to find the max length of these columns to see if those long lengths were really necessary or not:
8000 character length is a lot to store our string columns. Let’s check the required length for these columns. Do they really need to be that long?
SELECT
max(len(order_id)) order_id_max_length, max(len(customer_id)) customer_id_max_length, max(len(order_status)) order_status_max_length
FROM
OPENROWSET(
BULK ''https://your_adls2_account_name.dfs.core.windows.net/path_to_csv_file.csv'',
FORMAT = 'CSV',
PARSER_VERSION = '2.0',
HEADER_ROW = TRUE
) AS [result]
results show:
The simple answer to our question is: No.
As you see, the max length of the order id, customer id and order status are 32, 32, and 11 in order, while the open row set function has assigned them a length of 8000 characters, which is much longer than what’s needed.
Why change the inferred and initial data types?
As a result, it’s recommended to use as small as possible data types.
How do we address this limitation?
We need to declare the collation when running the openrowset function. Let's see the Optimized query. First, we should use a VARCHAR data type with a value that is more than the maximum length of the values in that column. Additionally, since we are working with a CSV file that is encoded in UTF8, we need to declare a collation that matches our encoding:
SELECT
*
FROM
OPENROWSET(
BULK 'https://your_adls2_account_name.dfs.core.windows.net/path_to_csv_file.csv',
FORMAT = 'CSV',
PARSER_VERSION = '2.0',
HEADER_ROW = TRUE
)
WITH (
order_id VARCHAR(50) COLLATE Latin1_General_100_CI_AI_SC_UTF8,
customer_id VARCHAR(50) COLLATE Latin1_General_100_CI_AI_SC_UTF8,
order_status VARCHAR(15) COLLATE Latin1_General_100_CI_AI_SC_UTF8,
order_purchase_timestamp datetime2,
order_approved_at DATETIME2,
order_delivered_carrier_date DATETIME2,
order_delivered_customer_date DATETIME2,
order_estimated_delivery_date DATETIME2
) AS [RESULT]
in our query, the determination of collation for those 3 columns modifies how synapse interprets the characters in those columns. since we are using a csv file, and the collation of our file is UTF-8, while the default collation of the synapse master database is SQL_Latin1_General_CP1_CI_AS, it resulted in the potential conversion error message and inferring very large lengths for those columns. in fact, synapse was trying to map the UTF8 collation of the source file into its collation, and to insure all the strings will be converted, it used a very lengthy data type for them.
There is also another solution to create a new database and change its default collation to "Latin1_General_100_CI_AI_SC_UTF8" or something that is compatible with UTF8 encoding. I won't discuss that solution in this article.
Results:
After running the same query, we see:
while it was previously:
Now that we’ve specified a UTF-8 collation consistent with our CSV file’s encoding, rerunning the SELECT statement no longer results in any “potential conversion error…” messages.
we see a decrease in the execution time for the optimized query in comparison to the initial query. notice that in this case, we are talking only about a few thousand records and just one CSV file. When working with millions of records, these minor changes will result in major performance optimizations.