Dirty Secret of SAS, SPSS and Alteryx

Maciek Wasiak

Director in Xpanse - AI powered analytics platform

Published Jul 21, 2016

"I am not a Data Scientist and I just built
a predictive model!"

The Promise

Chances are, that if you are reading this post, you have seen at least one sales pitch/conference presentation of one of the mainstream (or less mainstream) Predictive Modelling software vendors.

The pitch typically follows the same enticing pattern: load the table, click a few buttons and here is your Model. It’s that simple. It’s meant to lure you into thinking – “if you buy our software, you can get your own models in just few clicks.”

I witnessed some presenters saying – “look, I am not a Data Scientist and I just built a churn model”. One might wonder why companies are complaining about the shortage of Data Science skills when so many sales reps can build Predictive Models in minutes …

It is a well-oiled sales machine, polished and perfected over the last 15 years. Eventually one of the examined tools wins the procurement process and licences are bought.

What happens next?

More often than not - nothing.

The Reality

The models that were supposed to rain down upon us just don’t materialise.
I have watched two telecoms buying SAS/SPSS software that ended up lying idle for the first nine months. A utilities provider bought a number of SPSS seats and within 12 months they had only produced 1 model. A bank who bought a comprehensive SAS suite and also invested in an extremely powerful server - after one model delivered by external consultants not a single model was built for the following three years. I could go on. These cases are not unusual – they are actually the norm.

So a good question to pose would be: why do companies burn hundreds of thousands of dollars on the stuff they don't use?
The answer is: because during the sales pitch they participated in, someone failed to explain how to obtain that little table - let’s call it a Modelling Table, which was used in the demo.

“Typically it takes months of manual coding to transform the source data into
a structured Modelling Table”

You see – Modelling Tables don’t just sit in your database.

As I wrote here – building Modelling Tables is NOT a part of most of Data Science curriculums, so not that many analysts truly know how to do it. Sometimes the key stakeholders in the process aren’t even aware that it is needed at all.

The truth is - typically the data in your database is nowhere near the format required by Predictive Modelling algorithms.

If you just load your tables from the database to the Predictive Modelling tool, it simply won’t work:

All Predictive Analytics algorithms require a rigidly structured table consisting of aggregates – otherwise called Features and the process of converting the source data to the Modelling Table is called Feature Engineering.

And here it gets a bit techie - the Features describe customers’ behaviours prior to certain events – usually differently timed for each customer. This is NOT how the data is stored in the databases.

Typically, it takes weeks or even months of a painful manual work to transform the source data into a structured Modelling Table. However, once you have it – (and this is where the sales pitch starts) it is rather straightforward to build a predictive model. The pitch is not quite a lie – but it’s a long road from honesty all the same if someone wants to sell you a Predictive Analytics tool.

Here is how the process looks like on the real-life Predictive Analytics project:

So back to the sales pitch - a more honest one would look something like this:

“We will start with loading the Modelling Table.
There is something you need to know – to get that table we had a team of Data Scientists plus ETL developers and it took us 4 months to get it right. If I talk more about this, it would ruin the pitch completely, so let’s just watch the last 5 minutes of this multi-month project.
Here it is, I’ll do a few clicks and we get the Model, ta da!”

Skipping this middle part is what makes such pitch deceptive, misleading and harmful to the clients.

Dirty Little Secret

"They know about it
but you are not supposed to"

The enormity of the effort burnt in the pre-modelling stage is what Tom Davenport called the “dirty little secret” of the industry.

One of the main reasons why it is a "secret" are the sales pitches that consistently avoid this topic. They know about it but you are not supposed to, as it would effectively kill or, at best, postpone the sale.

Not blaming only one side of the table - there are other reasons too, e.g. Data Scientists being truly embarrassed of how much time they spend on manual data prep. If you asked us what we do – it's “Machine Learning, Predictive Modelling and other AI stuff”, while in fact for the last 6 months we were trying to join the damn tables and create aggregates in SQL.

So much for the sexiest job of the 21st century.

Getting ready for the next sales pitch

The world of data analytics is changing – and not necessarily for the better from a Data Scientist’s point of view.

There are tens of modelling tools available, including free ones and these, coupled with current computing power, enable us to build Predictive Models really fast. Running Machine Learning algorithms has never been easier.

However, with the explosion of new data sources, it is more and more challenging to keep the pace when aggregating the source data into the format digestible by the Predictive Modelling algorithms. Getting the data ready for Modelling, it would seem - has never been harder.
And that - not the Modelling, is the real challenge for Data Scientists.

So next time when you are watching a Predictive Modelling sales pitch, perhaps you could consider grilling the M.C. a little bit on how exactly they obtained their Modelling Table and how you can get it within your organisation.
Because once you have it.. you can really be just a few clicks away from the model.

Cheers

Maciek

Michał Krawczyk

Principal Technical Architect at SAS

In general, preparation (ETL) of analytical tables are typical part of these projects, especially if provider has specialized tools to prepare it quickly, and (what's important) to maintain it in next years. It is very important because these table often contains thousands of columns. Even if you prepare first version in typical ETL tools without dedicated functionalities, expanding, maintaining and making changes will drive you crazy. Agile/Scrum rules! Widest analytical table i saw had 17k columns ;)

Victor L.

In my humble opinion, you still require great tools to assist you to prep the data in a shorter timeframe for modelling to take place. In most cases, 80% of the project time is allocated for understanding and transforming the data into the right format for modelling. Without spending enough time on understanding and transforming your dataset, what you get ultimately would be GIGO (Garbage In Garbage Out).

.Rogier Werschkull

Head of Data @ Coolgradient | Data-analytics trainer | Rockstar & AI artist @ night ;) | Love calling bullshit

Check this piece out Hakan Demir , Pruthviraj S. , Rene Ranjan William , Steffi Flegel !

.Rogier Werschkull

Head of Data @ Coolgradient | Data-analytics trainer | Rockstar & AI artist @ night ;) | Love calling bullshit

Great piece that clearly explains why the 'non sexy data work': ETL (specifically the T!) and datawarehousing (that should focus on building Subject Oriented and Integrated repositorie of information) are NOT dead!

1 Reaction

See more comments

To view or add a comment, sign in

Dirty Secret of SAS, SPSS and Alteryx

Maciek Wasiak

Director in Xpanse - AI powered analytics platform

The Promise

The Reality

Dirty Little Secret

Getting ready for the next sales pitch

More articles by Maciek Wasiak

Insights from the community

Others also viewed

R vs SAS vs SPSS- Comparison between Top 3 Data Analytics Tools

R vs SAS vs SPSS – Top 3 Data Analytics Tools

Data Analyst Roadmap

R Data Manipulating and Processing

The Data Cleaning Process in Data Analytics:

Who is a Data Analyst? What do they do? How to become one?

Tableau for Data Science: Advanced Techniques and Best Practices

Data Analysis Overview

The Power of Data: How Analysts Are Shaping the Future

Data Wrangling For Data Analysis

Explore topics

The Promise

The Reality

Dirty Little Secret

Getting ready for the next sales pitch

More articles by Maciek Wasiak

Five Things Businesses Need To Know BEFORE Hiring a Data Scientist: Thing 1/5 They Don't Understand Business

Let's stop saying "Big Data"

Automated Feature Engineering from Raw Data

“Less than 5% of Data Scientists actually work on Predictive Analytics projects”?

Predictive Analytics in Marketing – Case Study 2: Cross-Sell for Insurance

Predictive Analytics in Marketing - Case Study 1: Lead Generation for SaaS

What Every CMO Should Have: Predictive Analytics Roadmap

Debunking AI Hype for Marketers

Predictive Analytics - Automated vs Traditional

Xpanse Automated Predictive Modelling - video clip

Insights from the community

Others also viewed

R vs SAS vs SPSS- Comparison between Top 3 Data Analytics Tools

R vs SAS vs SPSS – Top 3 Data Analytics Tools

Data Analyst Roadmap

R Data Manipulating and Processing

The Data Cleaning Process in Data Analytics:

Who is a Data Analyst? What do they do? How to become one?

Tableau for Data Science: Advanced Techniques and Best Practices

Data Analysis Overview

The Power of Data: How Analysts Are Shaping the Future

Data Wrangling For Data Analysis

Explore topics