Dirty Secret of SAS, SPSS and Alteryx

Dirty Secret of SAS, SPSS and Alteryx

"I am not a Data Scientist and I just built
a predictive model!"

The Promise

Chances are, that if you are reading this post, you have seen at least one sales pitch/conference presentation of one of the mainstream (or less mainstream) Predictive Modelling software vendors.

The pitch typically follows the same enticing pattern: load the table, click a few buttons and here is your Model. It’s that simple. It’s meant to lure you into thinking – if you buy our software, you can get your own models in just few clicks.”

I witnessed some presenters saying – “look, I am not a Data Scientist and I just built a churn model”. One might wonder why companies are complaining about the shortage of Data Science skills when so many sales reps can build Predictive Models in minutes …

It is a well-oiled sales machine, polished and perfected over the last 15 years. Eventually one of the examined tools wins the procurement process and licences are bought.

What happens next?

More often than not - nothing.

 

The Reality

The models that were supposed to rain down upon us just don’t materialise.
I have watched two telecoms buying SAS/SPSS software that ended up lying idle for the first nine months. A utilities provider bought a number of SPSS seats and within 12 months they had only produced 1 model. A bank who bought a comprehensive SAS suite and also invested in an extremely powerful server - after one model delivered by external consultants not a single model was built for the following three years. I could go on.  These cases are not unusual – they are actually the norm.  

So a good question to pose would be: why do companies burn hundreds of thousands of dollars on the stuff they don't use?
The answer is:  because during the sales pitch they participated in, someone failed to explain how to obtain that little table - let’s call it a Modelling Table, which was used in the demo.

“Typically it takes months of manual coding to transform the source data into
a structured Modelling Table”

You see – Modelling Tables don’t just sit in your database.

As I wrote here – building Modelling Tables is NOT a part of most of Data Science curriculums, so not that many analysts truly know how to do it. Sometimes the key stakeholders in the process aren’t even aware that it is needed at all.

The truth is - typically the data in your database is nowhere near the format required by Predictive Modelling algorithms.

If you just load your tables from the database to the Predictive Modelling tool, it simply won’t work:

All Predictive Analytics algorithms require a rigidly structured table consisting of aggregates – otherwise called Features and the process of converting the source data to the Modelling Table is called Feature Engineering

And here it gets a bit techie - the Features describe customers’ behaviours prior to certain events – usually differently timed for each customer. This is NOT how the data is stored in the databases.

Typically, it takes weeks or even months of a painful manual work to transform the source data into a structured Modelling Table. However, once you have it – (and this is where the sales pitch starts) it is rather straightforward to build a predictive model. The pitch is not quite a lie – but it’s a long road from honesty all the same if someone wants to sell you a Predictive Analytics tool.

Here is how the process looks like on the real-life Predictive Analytics project: 

So back to the sales pitch - a more honest one would look something like this:

We will start with loading the Modelling Table.
There is something you need to know – to get that table we had a team of Data Scientists plus ETL developers and it took us 4 months to get it right. If I talk more about this, it would ruin the pitch completely, so let’s just watch the last 5 minutes of this multi-month project.
Here it is, I’ll do a few clicks and we get the Model, ta da!”

Skipping this middle part is what makes such pitch deceptive, misleading and harmful to the clients.

  

Dirty Little Secret

"They know about it
but you are not supposed to"

The enormity of the effort burnt in the pre-modelling stage is what Tom Davenport called the “dirty little secret” of the industry.

One of the main reasons why it is a "secret" are the sales pitches that consistently avoid this topic. They know about it but you are not supposed to, as it would effectively kill or, at best, postpone the sale.

Not blaming only one side of the table - there are other reasons too, e.g. Data Scientists being truly embarrassed of how much time they spend on manual data prep. If you asked us what we do – it's “Machine Learning, Predictive Modelling and other AI stuff”, while in fact for the last 6 months we were trying to join the damn tables and create aggregates in SQL.

So much for the sexiest job of the 21st century. 

Getting ready for the next sales pitch

The world of data analytics is changing – and not necessarily for the better from a Data Scientist’s point of view.

There are tens of modelling tools available, including free ones and these, coupled with current computing power, enable us to build Predictive Models really fast. Running Machine Learning algorithms has never been easier.

However, with the explosion of new data sources, it is more and more challenging to keep the pace when aggregating the source data into the format digestible by the Predictive Modelling algorithms. Getting the data ready for Modelling, it would seem - has never been harder.
And that - not the Modelling, is the real challenge for Data Scientists.

So next time when you are watching a Predictive Modelling sales pitch, perhaps you could consider grilling the M.C. a little bit on how exactly they obtained their Modelling Table and how you can get it within your organisation.
Because once you have it.. you can really be just a few clicks away from the model.

Cheers

Maciek

Michał Krawczyk

Principal Technical Architect at SAS

5y

In general, preparation (ETL) of analytical tables are typical part of these projects, especially if provider has specialized tools to prepare it quickly, and (what's important) to maintain it in next years. It is very important because these table often contains thousands of columns. Even if you prepare first version in typical ETL tools without dedicated functionalities, expanding, maintaining and making changes will drive you crazy. Agile/Scrum rules! Widest analytical table i saw had 17k columns ;)

Like
Reply
Victor L.

Presales | Data Science | AI | Machine Learning | Predictive Analytics | Fraud Detection | Cloud | Process Improvement | Data Insights & Visualisation | Customer & Marketing Analytics | Advisor

6y

In my humble opinion, you still require great tools to assist you to prep the data in a shorter timeframe for modelling to take place. In most cases, 80% of the project time is allocated for understanding and transforming the data into the right format for modelling. Without spending enough time on understanding and transforming your dataset, what you get ultimately would be GIGO (Garbage In Garbage Out).

Like
Reply
.Rogier Werschkull

Head of Data @ Coolgradient | Data-analytics trainer | Rockstar & AI artist @ night ;) | Love calling bullshit

7y

Check this piece out Hakan Demir , Pruthviraj S. , Rene Ranjan William , Steffi Flegel !

Like
Reply
.Rogier Werschkull

Head of Data @ Coolgradient | Data-analytics trainer | Rockstar & AI artist @ night ;) | Love calling bullshit

7y

Great piece that clearly explains why the 'non sexy data work': ETL (specifically the T!) and datawarehousing (that should focus on building Subject Oriented and Integrated repositorie of information) are NOT dead!

To view or add a comment, sign in

More articles by Maciek Wasiak

Insights from the community

Others also viewed

Explore topics