Data Catalog for Snowflake
You have a data platform on Snowflake. And you are looking for a data catalog. Something that stores all the metadata (easily searchable, with AI). Something that crawls your Snowflake databases everyday, cataloging new tables along with their lineage from dbt or ADF. Perhaps something like Collibra or Atlan or Alation. Perhaps Glue or Purview or Informatica. Or data.world. Surely Snowflake has a data catalog, like Unity Catalog in Databricks?
Does Snowflake have a data catalog?
I’ll cut to the chase. Snowflake does not have a data catalog. What Snowflake calls Horizon Catalog (link, link) is not a product nor software. It is a framework that uses existing Snowflake features to manage data (link). For example: masking policy, tagging, data lineage, Data Metric Function, Universal Search and Access History.
What Snowflake calls Open Catalog is a data catalog product (not a vapourware), which is physically Iceberg tables. Those Iceberg tables can be in AWS, Azure or GCP i.e. in S3, OneLake or GCS. You can use Snowflake SQL, Python, Spark, Trino or Flink to update and query to those Iceberg tables.
Now, with Snowflake Horizon Catalog and Snowflake Open Catalog out of the way, let’s get into the real thing: Data Catalog for Snowflake.
Modern data catalog
You have a modern data platform (Snowflake) so naturally you want a modern data catalog. Your entire data governance program relies on it. But what kind of features should a modern data catalog have? There are 3 key features that a modern data catalog should have:
1. Search
You need to be able to find data easily. You can tag the data (tables, columns, files, dashboard elements, etc.) and search based on those tags. All sorts of tags, from GDPR, data age, sensitive / internal / confidential / restricted – basically all kinds of data classifications. And yes you want to search using AI, as well as SQL (AI for the business users, and SQL for the data analysts). You should be able to search the metadata, the dictionary and the data itself. And yes, the data profile and data quality too.
2. Scope
You need to catalog not just Snowflake databases but also all the files in your data lake and all the transformations in your ELT tool. And yes, your BI/reporting tool too. And you certainly do not want to type it all in! So a good crawler is essential. Programmable of course. And incremental.
3. Security
You need to specify who can access what data, which steward owns what data. You define the access policy. Column level security such as data masking and external tokenisation. Table level and row level security. Access history, like who was accessing what and when. Which data does not have a steward.
So those are 3 essential features of a modern data catalog. There are many other features and functionality of course, such as approvals (workflow), reporting, data quality, data profiling, data dictionary, data classification, etc. But essentially you want something like a Google indexer. It crawls all your data sources everyday, collecting information, classifying them and labelling them accordingly. And on the front end you have a Google search, which can find anything in your catalog, and display them nicely.
A software, any software, can’t be described as modern if they don’t have these technologies:
1. Cloud
It needs to be a managed service aka SaaS. You do not install anything. You pay based on your usage and data amount. It needs to “scale with you”. You should be able to move cloud provider so it needs to work with the Big Three (AWS, Azure, GCP). Both the web service and the agent should work in the Big 3, and the database too (all data catalogs have a database and many services). And the Key Store too (KMS).
2. Mobile
It can be accessed from a mobile device, preferably in an app. As well as from desktop of course. It should work from home (WFH) and in office (WFO). People work hybrid these days.
3. SSO
You do not want any user to type in any password. When they go to data catalog, they do not login. They go straight in. That is more secure than using a password. It’s called SSO (Single Sign On) for those of you who don’t know it. You manage your users in Entra ID, define them in groups. When the data catalog connect to Snowflake databases or S3 data lakes, you also want to use SSO such as IAM for AWS and Entra ID for Azure.
4. High Availability
Primary & secondary instance, auto sync/replication. Upgrade is automatic (you don’t upgrade to a new version every 3 months). Backup is automatic (you don’t do backup manually every day), restore is simple. Do you not want your data catalog tool to be down for hours. Slowly but surely it will become a critical tool in the business. Your data governance depends on it, and therefore your head.
Recommended by LinkedIn
Open source data catalog
Not all organisations have $300k budget for a data catalog, plus $200k for the salary. If you are looking to pay a part timer (say 0.5 FTE) and a free software (read: open source), head straight down to this Altan page: link. On that page Atlan describes these 6 data catalogs, all of them are open source: Atlas from Apache, Amundsen from Lyft, DataHub from LinkedIn, Marquez from WeWork, Open Data Discovery and Open Metadata. Let me add one more: MetaCat from Netflix.
Note: Marquez is more of data lineage than data catalog.
Of course, being open source (essentially no cost) you can’t demand the “modern-ness” that I described above (SaaS, SSO, Mobile, HA). Nor the “Rolls Royce” functionalities I listed above (AI search, auto crawler, lakes and BI, masking, row-level, access history). But the above 7 tools gets you started in data governance. At least you are in the game, rather than watching from outside the arena. It’s not Rolls Royce, but it gets you from A to B.
Enterprise Data Catalog
If you have $300k for a data catalog, and another $500k for the FTEs and consultancies, and you expect a modern data platform with Rolls Royce functionalities, bells-and-whistles with all the trimmings, then you want an Enterprise Data Catalog.
With that amount of money (that's per year, not one off) you would have considered all the names in Gartner and Forrester: Alation, Atlan, Attacama, Collibra, data.world, Informatica and so on (not necessarily in that order, not sure about IBM). You would have called a consultancy too, EDMC and so on.
And yes, you yourself must have spent lots of years in data. You manage teams of data engineers, or a data governance function or a CDO. And you have a task in hand to choose a data catalog, for your Snowflake data platform.
It’s a dilemma isn’t it. Unlike Databricks who has Unity Catalog, Snowflake does not have a data catalog. Which, in my opinion, a blessing. That situation forces you to get a good Enterprise Data Catalog.
Data Catalog for Snowflake
I have given you the criteria in terms of functionalities (Search, Scope, Security). I have given you the technical criteria of a modern tool (SaaS, SSO, Mobile, HA). And you have read Gartner and Forrester opinion. All you have to do is choose the top 3 and then call them for demo and RFP. It’s the usual process, like any other software with any other vendor.
But I know, I know. The reason you read this article is you want my opinion on which of the above product is the best data catalog application. According to the criteria I mentioned above. Well before I give my opinion, I must stress that this is my own opinion, not of my employer’s, nor my client’s. And it is an opinion, so it is subjective. What I think is good is not necessarily the same as what you think is good. Or other people.
Right, with that caviat, I’ll choose 3 names who I think is a strong contender data catalog for Snowflake. Here goes: in my opinion Alation is a strong contender. Not Atlan, but Alation. They have similar names therefore I need to clarify. In terms of scope, Alation can connect to various databases, file systems, BI tools, and ETL tools (they use OCF, Open Connector Framework) and crawl them. They support SSO, it’s SaaS, it integrates with Teams and Excel. Yes, you can browse/query your data catalog within Excel.
The industry leader is probably Collibra (i.e. the most famous name). They have wide range functionalities such as data classification, data dictionary, quality extraction, data profiling and data lineage. They have a lineage harvester (crawler) that works in various databases (Snowflake, Databricks, BigQuery, Redshift, SQL Server, etc.) and in various ETL tool (ADF, dbt, Matillion, Informatica, SSIS, DataStage, etc.). See: link.
Last but not least is IDMC i.e. Informatica Data Management Cloud. Their core product was what used to be known as Axon (name of a data company, no longer exists). Their data catalog and their data crawler (called EDC) are both originated from Axon. IDMC is a complete DG package including data catalog, data integration (PowerCenter), data quality, MDM and data dictionary. They can manage data privacy, data access, data governance and API integration. And they have data marketplace too. If you are considering ETL tool (data integration) in along side data catalog, IDMC is definitely a strong contender.
So that’s it, my top three. And yes, I have used IDMC, Atlan, Alation, Collibra. Only briefly though, I spent a lot more time in databases and ETL than in DG. From UI point of view, Collibra and Atlan’s look and feel is slightly more modern than IDMC and Alation. But all 4 of them feel modern, not like 20 years old app. If I have to choose four I’d probably include Atlan. Once again, that this is my own opinion, not the opinion of my employer or my client. Or any other organisations. And I don’t receive any encouragement or reward from any of the companies I mentioned above. And this is an opinion, meaning that it is subjective, not objective. Which would be different from other people’s opinions. Or those of the vendors.
I would welcome YOUR opinion though (both your personal opinion) and your company’s opinion. Let me know on the comments below (if you want to say it publicly). Or message me if you want to say it privately.
Keep learning!
List of my articles: https://lnkd.in/eRTNN6GP
Data & AI Leader | Transforming Data into Business Impact | Data Governance, Analytics, Management and Strategy Expert
2moThank you for sharing your experience on Data Catalogs and lineage. It is so much relevant for modern organizations looking to capitalize on their data assets not just on Snowflake but I believe same principles would apply to other platforms as well.