Permanent Head Damaged in Big Data Technology

Permanent Head Damaged in Big Data Technology

Once Sherlock Holmes quotes, that is a capital mistake to theorize before one has data. Well, data-driven decision making has proven to be essential for organizational performance improvements. This stimulates organizations to gather data, analyze it, and use decision support models to improve their decision making speed and quality. With the rapid decline in the cost of both storage and computing power, there are nearly no limitations to what you can store or analyze. As a result, organizations have started building data lakes and invested in big data analytics platforms to store and analyze as much data as possible... Are you doing the same thing?

This case is especially valid in the consumer goods and services sector, where big data technology can be transformative as it enables very granular analysis of human activity (up to the personal level). With these granular insights, companies can personalize their offerings, potentially increasing revenue by selling additional products or services. This allows for new business models to emerge and is changing the way of doing business entirely. As the potential of all this data is enormous, many organizations are investing in big data technology expecting plug and play inference to support their decision making. The big data practice, however, is something different and is full of rude awakenings and headaches.

That big data technology can create value is proven by the fact that companies like Google, Baidu, Facebook, Alibaba, and Amazon exist and do well. Surveys from Gartner and IDC show that the number of tech companies adopting big data technology is increasingly active. Many of them want to use this technology to improve their business and start using it in an exploratory manner. When asked about the results they get from their analysis, many of them respond that they experience difficulty in getting results due to data issues; others report problems getting insights that go beyond preaching to the choir. Some of them even say disappointment as their outcomes turn out to be wrong when putting into practice. Many times the lack of experienced analytical talent is mentioned as a reason for this, but there is more to it. Although big data has the potential to be transformative, it also comes with fundamental challenges which, when not acknowledged, can cause unrealistic expectations and disappointing results. Some of these challenges are even unsolvable at this time.

  • Everyone knows that even if there is a lot of data, it can’t be used properly

To explain some of these fundamental challenges, let’s take an example case of an online retailer. The retailer has data on its customers and uses it to identify general customer preferences. Based on the defined preferences, offers are generated, and customers targeted. The retailer wants to increase revenue and starts to collect more data on the individual customer level. The retailer intends to use the additional data to create personalized offerings (the right product, at the right time, for the right customer, at the right price) and to make predictions about future preferences (so the retailer can restructure its product portfolio continuously). To do so, the retailer needs to find out what the preferences of its customers are and the drivers of their buying behavior. This requires constructing and testing hypotheses based on the customer attributes gathered. In the old situation, the number of available attributes (like address, gender, past transactions) was small. Therefore only a small amount of hypothesis (for example, “women living in a certain part of the city are inclined to buy a specific brand of white wine”) can be tested to cover all possible combinations. However, with the increase in the number of attributes, the number of combinations of attributes that are to be investigated increases exponentially. The practical way around this is to reduce the number of attributes taken into account significantly. This will leave much of the data unused and many possible combinations of attributes untested, therefore reducing the potential to improve. This might also cause much of the big data analysis results to be too obvious.

  • The larger the data set, the stronger the noise

There is a different challenge with analyzing large amounts of data. With the increase in the size of the data set, all kinds of patterns will be found, but most of them are going to be just noise. Recent research has provided proof that as data sets become more extensive, they have to contain arbitrary correlations. These correlations appear due to the size, not nature, of the data, which indicates that most of the relationships will be spurious. Without proper practical testing of the findings, this could cause you to act upon a phantom correlation. Testing all the detected patterns in practice is impossible as the number of detected relationships will increase exponentially with the data set size. So even though you have more data available, you’re worse of as too much information behaves like very little information. Besides the increase of arbitrary correlations in big data sets, testing the massive number of possible hypotheses is also going to be a problem. 

This implies that we will find an increasing number of statistically significant results due to chance alone. As a result, the number of False Positives will rise, potentially causing you to act upon phantom findings. Note that this is not only a big data issue but a small data issue as well. In the above example, we already need to test 1024 hypotheses with ten attributes.

  • Data-driven decision making has nothing to do with the size of your data 

Consequently, should the above challenges stop you from adopting data-driven decision making? No, but be aware that it requires more than just some hardware and a lot of data. Sure, with a lot of data and enough computing power, significant patterns will be detected even if you can’t identify all the trends that are in the data. However, not many of these patterns will be of any interest as spurious patterns will vastly outnumber the meaningful ones. Therefore, with the increase in the size of the available data also the skill level for analyzing the information needs to grow. In my opinion, data and technology (even a lot of it) is no substitute for brains. The smart way to deal with big data is to extract and analyze essential information embedded in “mountains of data” and to ignore most of it. You could say that you first need to trim down the haystack to locate better where the needle is. What remains are collections of small amounts of data that can be analyzed much better. This approach will prevent you from getting a big headache from your big data initiatives and will improve both speed and quality of driving data-driven decisions within your organization.

Enjoy the journey!

Hadi I

Data Engineer (Python, Pyspark, SQL) and Analyst

5y

Big data is just a distributed computing platform....for heavy load processing.... analytics can be put there but not always.... i agree with you....the wrong conception nowadays is having to relate big data and analytics in an automatic way.... although is not always the case....

To view or add a comment, sign in

More articles by Alfred Boediman

Insights from the community

Others also viewed

Explore topics