Hadoop vs The Trough

Hadoop vs The Trough

I was at an analytics conference recently where the key note speaker talked about how Hadoop is entering the Gartner Trough of Disillusionment.  This is part of the Gartner Hype Cycle that tracks new technology from its initial rapid rise in the Innovation Trigger, to the Peak of Inflated Expectations, then an equally rapid descent through the Trough of Disillusionment.  From there, it’s a steady climb up the Slope of Enlightenment to the Plateau of Productivity.  It is a nice simple lifecycle that actually tracks new technology surprisingly well.

Hearing about Hadoop’s entry into the trough raised a smile to my face and a little cheer within.  This is not, I hasten to add, because I don’t like Hadoop.  Quite the opposite.  I have produced the internal white paper for our company explaining what Hadoop is and why we need it.  From there I have created the conceptual architecture that integrates Hadoop into our existing data warehouse estate.  Put simply, I get it.  The reason for my internal cheer was a hope that we, as data professionals, will start to apply good data architecture practice to the field.  Let’s look at an example.

Schema-on-Write vs. Schema-on-Read

Several of the use cases I have put forward for Hadoop rest on the schema-on-write vs schema-on-read argument.  Schema-on-write is what we do in the traditional data warehouse environments.  It entails understanding our data sources, transforming them into a consistent format, syntactically and semantically, then loading them into a data schema.  This data schema is defined prior to data loading, and is supported by a data model.

Schema-on-read says “let’s get hold of the data and just load it, from there we can apply data retrieval methods and see what’s in there”.  I have massively simplified the method here, but you get the idea.  No requirement to transform the data before loading.  No requirement to have a fixed data schema or model before loading.  Schema-on-read is a powerful, useful and quite reckless tool depending on its use.

The value of Schema-on-Read

There are times where business cases struggle to articulate the value of putting data into corporate data warehouses.  Requirements may not be fully understood, the cost of integrating the data can seem high when compared to value understood at that point in time are just two examples.  Schema-on-read can be used to capture data into relatively low cost storage with minimal effort.  From there, it can be explored iteratively with the aim of looking for value.  For me, this brings us back to the spirit of what people like Bill Inmon originally intended.  Low cost iterative development in a safe environment, done until we find data products of value.  This, if done correctly, is what schema-on-read can give us, and I love it.

But we’re only half way there

So going back to my point about good data architecture practice.  There is a danger that once we have found value in the data, and even producing semi-live reports and analysis, that we can consider our job done.  But doing this too many times will result in the data spaghetti that I often see in organisations.  So the next step is back to the business case.

If we have found value in the data, and that value can be sustained over time, then we should complete the business case for this data to be moved into the corporate data warehouse (by the way, when I talk about the corporate data warehouse I make no assumption on the technology, be that relational, NoSQL, etc. , it’s just the trusted source of data for the organisation).  But this time we are going back with a proven value and a complete understanding of the new data source and business requirement.  This is where the good data architecture comes into play - understanding our business requirement, the data needed to satisfy this and the methods to align the delivery of the data to the requirement in a secure and repeatable manner.

So don’t be too disillusioned by the trough.  This is good news for Big Data, Hadoop and the data professionals that work in this area.  By applying good data architecture practice to the field of Big Data we will see sustainable business value being driven from that data.  I can already feel myself climbing the Slope of Enlightenment.

So Hadoop is like a playground where we hope to get a return on investment? Are there any studies that detail the probabilities of store first and ask questions later? If this question can be answered, then I consider it enlightenment.

Like
Reply
James M. Dey

Data Architect | Cloud Solutions | Driving Scalable Data Strategies

8y

Good article. I think it's great that the data world is getting an ever increasing toolkit. Data integration whether it's on the way in (data schema on write) or way out (data schema on read) and data quality remain the biggest challenges. There seems to have been great enthusiasm in the last few years to simply build a big data cluster and then have data scientists identify patterns in the data without perhaps considering whether those insights will add to the bottom line, or how the quality of the data might distort the value of those insights. I suspect that that may have led to the trough of disillusionment mentioned in the article. My take on the comparitive features and use cases of traditional & big data technologies for reporting and analytics:- Operational Data Store:- Completeness: Data is typically confined to a particular business function e.g. sales, marketing, finance etc. Development time: Relatively quick since data is typically not enriched or only slightly enriched. Scalability: Limited. Usually ODS is on a single server. Usability: ODS are typically queried via SQL which is understood by a lot of end users, and reporting tools can interact with it easily. Load Latency: Can be near real time, if effectively a replication of a source system, or batched typically with loads within minutes or hourly from source. Query Latency: Low Use case: Operational reporting e.g. sales in last x minutes Data warehouse:- Completeness: All data related to 1 or more business functions e.g. sales, marketing, finance is made available for querying. Development time: It takes a long time to model and work out how to integrate all the data, only a small amount of which may ever be queried. Scalability: MPP data warehouses can be spread across a cluster, but performance degrades as each node is added. Usability: Same as ODS Load Latency: Loads in to data warehouses are typically overnight jobs. Query Latency: Low Use case: Analytical reporting e.g. sales over a number of years. Hadoop:- Completeness: Typically more source data is captured than a data warehouse but not immediately queryable. Development Time: Short to capture the data and to focus on specific use cases. However, cleansing data to make it possible to integrate data from disparate sources is still a challenge. Scalability: Data can be spread across a large number of nodes in a cluster. Usability: Original, this required a programmer to write Map-Reduce jobs in order to query the data. Later tools such as Hive and Drill allowed SQL scripts to be written (which are then converted in to Map-Reduce jobs). Latency: Batch processing. Typically overnight. Query Latency: High Use case: Extracting information from large volumes of data e.g. parsing web pages Spark:- Similar to Hadoop but data is held in memory (where it fits) and paged out to disk in the cluster. Data can be processed in mini-batches, reducing load latency. Spark SQL is faster than Hive or Drill. Use case: Similar to Hadoop but where you need to process data quicker e.g. detecting interest in a product on a typical website and recommending a product. Flink:- New kid on the Big data block, which is similar in concept to Spark but can deal with streamed data with very low load latency. e.g. detecting interest in a product on a high traffic website and recommending a product.

Ben Clark

✔ Roadmaps that everyone can understand ✅ Fixing organisations whose growth has outstripped their operating model, process and fragmented applications

9y

Excellent article Paul

Tony C.

Manager at Clarity Recruitment Consultants

9y

Worth reading

Like
Reply
Aki Matsushima

Data Science, Data Visualisation Design

9y

Great post by the way, I am excited about our hadoop journey :)

Like
Reply

To view or add a comment, sign in

More articles by Paul White

  • It's worse than that Jim, it's science

    In this blog entry I want to talk about data scientists. In doing this I am aware that much has already been said about…

    3 Comments
  • Of Elephants and Data Models

    First a disclaimer, this is a posting containing the words “data” and “elephant” but is not about Hadoop (hey, what…

    4 Comments

Insights from the community

Others also viewed

Explore topics