Know Thyself, Know Thy Data: Stop Guessing and Start Impressing
Picture this: a conversation with a customer about their data. It goes something like this:
“How much data do you have?”
“I don’t know.”
“Approximately?”
“Maybe 500 terabytes.”
“Okay, is that before or after compression?”
“I don’t know.”
“What are your five biggest tables?”
“I don’t know.”
“What’s the biggest table?”
“Not sure.”
“How is the data stored—by columns or rows?”
“I don’t know.”
“Is the table compressed?”
“No idea.”
“How is the data sorted?”
“I don’t know.”
“How is the table most frequently used?”
“I don’t know.”
“What business process is it used in?”
“Not sure.”
“What other datasets do you join it with?”
“I don’t know.”
“Why do you even have the table?”
“No clue.”
“Who created it? Who’s updating it? When was it last reviewed? Does it have stale data?”
Recommended by LinkedIn
“I don’t know, I don’t know, I don’t know, I don’t know.”
This conversation, while exaggerated, highlights a critical issue in many enterprises today: a shocking lack of understanding about their own data. Companies often get caught up in heated debates about which tool to use—whether it’s the latest cloud platform, a shiny new analytics suite, or a cutting-edge AI model. But here’s the key hypothesis: Enterprises spend too much time debating which tool to use and not enough time learning to use the tool effectively or, more importantly, understanding their own data.
What's the Remedy?
The remedy lies in specific individuals spending sufficient time studying their own data and having a holistic sense of the data sets and the workloads. Then this individual and or small team should document document the same for the benefit of the team.
Step 1: Map and Document Your Datasets
To gain control over your data, start by creating a comprehensive inventory of all datasets and tables. This process should prioritize focusing on the largest and most critical datasets first. There is no need to thoroughly map thousands of data sets when normally 5 to 25 tables contain the majority of the data by size.
Here’s a structured approach:
List All Datasets and Tables by Size: Begin by cataloging every dataset and table in your environment, sorted by size (in terabytes or gigabytes). Use database metadata or storage analytics tools to extract this information.
Document Key Attributes for Each Table:
This mapping exercise transforms “I don’t know” into a clear picture of your data landscape, enabling better decision-making and optimization.
Step 2: Analyze and Optimize Query Performance
Once you’ve mapped your datasets, shift focus to how they’re used. Queries are the bridge between data and business value, but poorly understood or inefficient queries can cripple performance. To address this, analyze the most resource-intensive queries and their impact on your system.
Identify the 10 Most Expensive Queries by Run Time:
Use query performance logs (available in most database management systems) to list the 10 longest-running queries. For each query, record:
Rank Queries by Total Daily Run Time
Calculate the total daily run time for each query by multiplying its average run time by its frequency. For example: Sort the queries by total daily run time to prioritize optimization efforts.
Queries with high daily run times (like Query B) are prime candidates for improvement.
Step 3: Optimize High-Impact Queries:
For each of the top queries, conduct a detailed analysis to determine if it can be optimized. Key questions to ask:
Optimization requires serious study, often involving collaboration between data engineers, analysts, and business stakeholders to balance performance gains with business requirements.
By combining disciplined dataset mapping with query performance analysis, enterprises can move beyond tool obsession and focus on what truly matters: using their data effectively to drive business outcomes.