The Future of Data is Open

The Future of Data is Open

A summary of why open data sets are so important for providing context to data for analysis and machine learning. The full post is on my blog.

Generally data we retain within organisational databases is very factual. “Tom bought 6 cans of blue paint”. “Mary delivered package #abc to Jean at 2.30pm”. These are typical examples of the types of records you might find in various corporate database systems. They are recording specific events that the company ahead of time has decided is important to keep and have invested in building applications to create and manage this data.

This data is by nature very focused on “what” has or is happening, which of course is useful for lots of business operations such as logistics, auditing, reporting and so on. However over time the use of this data for most organisations turns to trying to understand “why” things happened so we can influence either making them happen more (for good things) or less (for undesirable things). To try and answer this we sometimes create data warehouses that pulled data from various sources into a single repository with the view of creating analysis that focuses on giving explanation to data rather than reporting just aggregated fact.

But this is where we staring having issues. Our proprietary data represents decisions made by customers and/or staff in the context of their "real world" lives, and real world reasoning is highly complex and influenced by many factors. So complex in fact that many organisations may struggle to explain these relationships in any logical fashion. Sure Tom might buy paint, but why did he need paint, why did he buy blue paint, why did he buy 6 cans, why did he purchase at 2.30pm on a Tuesday, did Tom need anything else we sell at that time other than paint? From our factual data alone the context needed to determine such things simply just may not exist in the data, so we can share at this data all day long we are never going to be able to answer such questions with authority.

Read more >>

Ron Dunn

Performance Specialist at Snowflake - The Data Cloud

8y

Great work, Tony. You're right on the money with this statement, "Publishers of open datasets need to understand that quite probably it will never be a human consuming their data directly in the future, instead it will be a computer for which raw data is more useful." Unfortunately, most data sets in Australia fail for this reason.

To view or add a comment, sign in

More articles by Tony Bain

Insights from the community

Others also viewed

Explore topics