Big Data Reduction 2: Understanding Predictive Analytics

Michael Wu PhD

Chief AI Strategist at PROS / Lecturer / Behavior Economist / Neuroscientist

Published Jan 26, 2018

Last time we described the simplest class of analytics (i.e. descriptive analytics) that you can use to reduce your big data into much smaller, but consumable bites of information. Remember, most raw data, especially big data, are not suitable for human consumption, but the information we derived from the data is.

Today we will talk about the second class of analytics for data reduction—predictive analytics. First, let me clarify 2 subtle points about predictive analytics that is often confusing.

The purpose of predictive analytics is NOT to tell you what will happen in the future. It cannot do that. In fact, no analytics can do that. Predictive analytics can only forecast what might happen in the future, because all predictive analytics are probabilistic in nature.
Predictive analytics are not limited to the time domain. Some of the most interesting and useful predictive analytics in big data (e.g. social media) are non-temporal in nature. We will look at some examples later in this article.

Predictive Analytics: Extrapolate (e.g. Forecast)

The easiest way to understand predictive analytics is to apply it to the time domain. In this case, the simplest and most familiar predictive analytic is a trend line, which is typically a time series model (or any temporal predictive model) that summarizes the past trajectory of the data. Although temporal predictive models can be used to summarize existing data, the power of having a model is that we can use them to extrapolate to a future time where data doesn’t exist yet. This extrapolation in the time domain is what scientists referred to as forecasting.

Although predicting the future is a common and easy-to-understand use case of predictive analytics, statistical models are not limited to predictions in the temporal dimension. It can literally predict anything, as long as its predictive power is properly validated (see Learning the Science of Prediction). The essence of predictive analytics, in general, is that we use existing data to build a model. Then we use the model to predict data that doesn’t yet exist. So predictive analytics is all about using data you have to predict (or "estimate" for non-temporal predictions) the data that you don’t have.

In the case of temporal prediction, the data we don’t have are inaccessible because it is impossible to obtain future data. No one has data about the future, because no one can measure the future. We would need a time machine to do this. But there are many other reasons why we may not have the data we need and therefore need to use a model to estimate or infer it. And these reasons are almost always that the data is too expensive to measure, either to the degree of accuracy or scale we need.

Examples of Non-Temporal Predictive Analytics

If you’ve been following my writing on influence, you’ve already seen an example of non-temporal predictive analytics where a model uses someone’s existing social media activity data (data we have) to estimate his/her potential to influence (data we don’t have). In this case, influence data is just really hard to measure and track at the social web scale (see Why Brands STILL don't Understand Digital Influence?). So we use something that we can measure in large-scale (i.e. people’s social media interactivity data) to estimate it.

Another well-known example of non-temporal predictive analytics in social analytics is sentiment analysis. Like influence, no platforms out there actually measure people’s sentiment. To really measure someone’s sentiment, the platform would have to survey his sentiment for every entity (i.e. person, place or object) that is mentioned in his posted message. Even if such a platform existed, no user would bother to provide the answers when he is explicitly prompted for such sentiment data for a list of mentioned entities.

This means nobody has actual sentiment data, not at the scale of the social web. However, we can track and store the textual content of any user posting (e.g. tweets, updates, blog articles, forum messages, etc.). So we build a model that estimate the user’s sentiment (data we don’t have) from his postings (data we have).

How do you do this? Well, there are many ways, but it really doesn’t matter. As long as the model is able to estimate the user’s sentiment accurately, who cares how you come up with the model? But how can you be sure of the model’s prediction accuracy? You must validate the model with an independent measure of sentiment (see Learning the Science of Prediction).

With predictive analytics, coming up with a predictive model is the easy part, because anyone can hypothesize a theory about how things work and build a model base on his theory. The hard part is validating it (i.e. making sure that it can actually predict accurately).

Conclusion

While the purpose of descriptive analytics is to summarize and tell you what has happened in the past, the purpose of predictive analytics is to tell you what might happen in the future. Although predictive models can certainly summarize the existing data through its model parameters, the real advantage of having a model is that we can extrapolate the model to regimes where data doesn’t exist and make predictions. So general predictive analytics is all about using data that we have to estimate or infer some data that we don’t have due to:

The data is inaccessible: They are impossible to obtain (e.g. they are from the future)
The data is too expensive to measure

So you have seen 2 classes of analytics for data reduction: descriptive and predictive. Next time let’s take a look at what prescriptive analytics is all about. More importantly, what does it take to have prescriptive analytics? You don't want to miss the next post...

In the meantime, do you work with predictive analytics? What is the data that you have (i.e. the accessible data), and what is the data that you don’t have and are trying to estimate (i.e. the inaccessible data)? Do you validate your predictive models? Let’s open the floor to discussion. That’s how we all learn...