Data Sprawl - What (non-IT) Managers can do to Reduce it

In this age of big data, where we have access to more information in a day than the last generation would see in a lifetime, where every start-up is trying to sell you a database or app that is supposedly the ‘next big thing’ (spoiler: it isn’t), and where AI-technology promises to give you more data and automated trend analyses than you ever thought you needed (and probably don’t), the risks of Data Sprawl in organizations is only going to get bigger, and all managers should be aware of this and what they can do to prevent it.

In my security role I work with a lot of data and have encountered numerous issues that have made me aware of the growing problem of Data Sprawl. There are many articles out there about the problem, but these generally all try to sell you something to resolve it. With that in mind, I thought I could start the year by sharing some thoughts and tips about managing Data Sprawl, minus any sales pitches.

What is Data Sprawl?

Simply put, Data Sprawl is what happens when an organization fails to control its data intake (of any kind), focusing on quantity over quality, creating in turn numerous problems that can seriously affect performance, efficiency and security. In this day and age, this can and does occur all too easily, as digital data piles up unseen, taking up no physical space, but cluttering your digital space with unsearchable, duplicated, unwanted or just plain bad data. The more cluttered and interconnected the data gets the more difficult it is to undo.

Data security

With great data (and bad data) comes great responsibility. The more data your organization collects and stores the more you need to keep track of what is being stored, where it is being stored, as well as how it is being used and shared, also taking into account requirements for data privacy and data sovereignty in different territories. A dataset of sensitive information that gets forgotten or lost on an online data network (maybe used for a small project and then never deleted) is not much different to misplacing a physical folder in the office; in both cases the data is not being properly monitored and could be compromised. If you get to the point that you can’t keep track of what data you have, that data is at risk.

Duplication of data and effort

Collating or creating new data can of course be beneficial to an organization, and even exciting to some, but not if you are duplicating existing work stored on your network that a colleague already did months ago (and another the year before that) – that’s just wasting time and resources.

Data quality

One might be tempted to merge several data feeds to create a super-database, but unless this is done methodically with careful analysis of each data feed, it can create numerous problems further downstream that may not be visible to users, such as duplicated data artificially inflating statistical results or heat maps, inconsistent data fields creating wrong search results or false negatives, and generally bad data input from some sources tainting the whole dataset. Adding bad data to a reliable data stream is a bit like adding raw sewage to the water mains. Yes, you are adding to the water supply, but also rendering it useless (unless having more sewage is your goal).

Maintenance

Data feeds through whatever means (API, SFTP, email, pony express, etc.) will all have occasional failures and require fixes and maintenance. If you only have one feed this might just happen 2-3 times a year, but if you have 20 data feeds your team could end up having to deal with an endless cycle of weekly failures, taking up significant time and resources.

Efficient Data Management and Usage

But we all need more data, don’t we? We have to stay ahead of the curve! We don’t want to be left behind, do we? It depends on what you want to achieve and how you want to achieve it. If you don’t know what your end goal is, starting with the data will not help. Throwing more data at an undefined problem or question is like adding wind to your sails without having set a course. Furthermore, giving your team more data than they can handle only serves to hinder rather than enable, as they spend more time searching through different data sources and assessing them rather than actually analyzing the usable data to obtain useful findings.

But more data means making more informed decisions, right? That might be, but how much time does your team have to make these decisions, and how many such decisions do they have to make every day? An overabundance of data can lead to Analysis Paralysis where team members struggle to make effective and timely decisions due to overthinking or overanalysing vast amounts of data. Even if your team effectively triage and prioritize the available data to make a decision, they can still fall foul of management later applying hindsight analysis to the wider pool of data to then second-guess their team’s decisions at every turn (which is never a winning formula).

Costs

If that still does not convince you, then remember that data requires storage, which costs money. You’d be surprised how quickly 100TBs of storage can fill up when you don’t control what goes in, all the more annoying when you discover that 80% of that data is not being used by anyone.

Reducing Data Sprawl – what managers can do

As I mentioned, there are numerous companies out there offering solutions to help reduce Data Sprawl, and I’m sure that some of them are very good. But before you try to resolve the issue of too much data by purchasing yet another data tool, think first about what you can do to tidy and manage your data through good housekeeping and best practices.

  • Remember that Data Sprawl is not just an IT issue, nor is it an operational issue, it is first and foremost a management issue that is best resolved through strong and efficient leadership.
  • Determine who has oversight of the data. Is it someone in-house, and if so, do they have the necessary authority to authorize or veto data new data sources? Is it managed by a 3rd party, and if so, have they accepted liability for this data?
  • Take responsibility for the data you use - What is it? Is it any good? If you don’t understand the data you are supplying to your team, how do you expect them to understand it?
  • If the data is no good, get rid of it.
  • Assess your data to determine if you need it. What specific purpose does it serve and what is the measurable outcome of this data?
  • Ensure that any sensitive, confidential or private data is treated accordingly, with proper supervision and accountability.
  • If merging multiple sources of data, determine if these are actually compatible. It is better to accept when data sources cannot be merged rather than shoehorning together things that don’t match.
  • Keep innovation under check and proper scrutiny. If an enthusiastic team member or contractor offers to create or provide a new data source, ask them to justify the value of it as well as explain how they plan to maintain it henceforth (hint: the terms “The team can do it” and “It maintains itself” are big red flags). Don’t fall for the trappings of innovation for innovation’s sake.

Those are just my thoughts as someone who uses a lot of data in my day-to-day work, and my experience in the fields of security and intelligence will likely differ from those working in other industries, but the basic principles will generally remain the same and the core solution before you start buying fancy tools is for managers to take responsibility and have better oversight of the data they let into their organization.

Any other practical tips are welcome . . . no sales pitches please!

To view or add a comment, sign in

More articles by Andrew Lowe

Insights from the community

Others also viewed

Explore topics