Know Thyself, Know Thy Data: Stop Guessing and Start Impressing

Know Thyself, Know Thy Data: Stop Guessing and Start Impressing

Picture this: a conversation with a customer about their data. It goes something like this:

“How much data do you have?”

“I don’t know.”

“Approximately?”

“Maybe 500 terabytes.”

“Okay, is that before or after compression?”

“I don’t know.”

“What are your five biggest tables?”

“I don’t know.”

“What’s the biggest table?”

“Not sure.”

“How is the data stored—by columns or rows?”

“I don’t know.”

“Is the table compressed?”

“No idea.”

“How is the data sorted?”

“I don’t know.”

“How is the table most frequently used?”

“I don’t know.”

“What business process is it used in?”

“Not sure.”

“What other datasets do you join it with?”

“I don’t know.”

“Why do you even have the table?”

“No clue.”

“Who created it? Who’s updating it? When was it last reviewed? Does it have stale data?”

“I don’t know, I don’t know, I don’t know, I don’t know.”

This conversation, while exaggerated, highlights a critical issue in many enterprises today: a shocking lack of understanding about their own data. Companies often get caught up in heated debates about which tool to use—whether it’s the latest cloud platform, a shiny new analytics suite, or a cutting-edge AI model. But here’s the key hypothesis: Enterprises spend too much time debating which tool to use and not enough time learning to use the tool effectively or, more importantly, understanding their own data.

What's the Remedy?

The remedy lies in specific individuals spending sufficient time studying their own data and having a holistic sense of the data sets and the workloads.  Then this individual and or small team should document document the same for the benefit of the team.  

Step 1: Map and Document Your Datasets

To gain control over your data, start by creating a comprehensive inventory of all datasets and tables. This process should prioritize focusing on the largest and most critical datasets first. There is no need to thoroughly map thousands of data sets when normally 5 to 25 tables contain the majority of the data by size. 

Here’s a structured approach:

List All Datasets and Tables by Size: Begin by cataloging every dataset and table in your environment, sorted by size (in terabytes or gigabytes). Use database metadata or storage analytics tools to extract this information. 

Document Key Attributes for Each Table:

  • Size: Record the table’s size before and after compression.
  • Storage properties: Any platform specific key options on how this data is stored
  • Source of Origin: Identify where the data comes from (e.g., internal application, third-party vendor, IoT devices).
  • Description of Columns and Their Meaning: For each column, document its name, data type, purpose, and any business rules (e.g., “Customer_ID: Unique identifier for customers, 10-digit integer, non-null”).
  • Usage Context: Document the business processes the table supports, how it’s most frequently queried, and which datasets it’s commonly joined with.
  • Ownership and Maintenance: Identify who created the table, who updates it, and when it was last reviewed for stale or redundant data.

This mapping exercise transforms “I don’t know” into a clear picture of your data landscape, enabling better decision-making and optimization.

Step 2: Analyze and Optimize Query Performance

Once you’ve mapped your datasets, shift focus to how they’re used. Queries are the bridge between data and business value, but poorly understood or inefficient queries can cripple performance. To address this, analyze the most resource-intensive queries and their impact on your system.

Identify the 10 Most Expensive Queries by Run Time:

Use query performance logs (available in most database management systems) to list the 10 longest-running queries. For each query, record:

  • Run Time: How long does it take to execute?
  • Consistency: Is the run time consistent, or does it vary (e.g., due to data growth or system load)?
  • Frequency: How often is the query run (e.g., daily, hourly, ad hoc)?
  • Datasets Involved: Which tables or datasets does the query access or join?

Rank Queries by Total Daily Run Time

Calculate the total daily run time for each query by multiplying its average run time by its frequency. For example: Sort the queries by total daily run time to prioritize optimization efforts.

  • Query A: 20 minutes, runs once daily = 20 minutes total daily run time.
  • Query B: 4 minutes, runs every hour (24 times daily) = 96 minutes total daily run time.

Queries with high daily run times (like Query B) are prime candidates for improvement.

Step 3: Optimize High-Impact Queries:

For each of the top queries, conduct a detailed analysis to determine if it can be optimized. Key questions to ask:

  • Are the joins efficient, or are they scanning unnecessary data?
  • Can indexes be added to speed up frequent operations?
  • Is the query pulling more data than needed (e.g., selecting all columns instead of specific ones)?
  • Can the query leverage partitioning, or pre-aggregated data?

Optimization requires serious study, often involving collaboration between data engineers, analysts, and business stakeholders to balance performance gains with business requirements.

By combining disciplined dataset mapping with query performance analysis, enterprises can move beyond tool obsession and focus on what truly matters: using their data effectively to drive business outcomes.

To view or add a comment, sign in

More articles by Ivan D. Novick

Insights from the community

Others also viewed

Explore topics