Myth Busting: Breaking through the common misconceptions of "Big Data"
Foreword: This article will form part of a series seeking to demystify many of the terminologies and technologies used to manage data within the modern enterprise. This series will cover Big Data, Advanced Analytics, What "Digital" actually is, Machine Learning/Artificial Intelligence and Data Science. The intention of these articles is to simplify each of these concepts, and differentiate them from one another in the hopes of educating how each play a part in the modern enterprise, and which elements you should seek to adopt within your entire enterprise strategy around data.
Big Data... chances are if you're working in technology, or even a business that is heavily reliant on technology, this term will either conjure feelings of the silver bullet solution being pitched by your technology vendor to solve all of your problems with data, or conversely, feelings that it is nothing more than a myth, a marketing term that has ultimately cost millions in lost opportunity and time in those you've seen attempt to implement it.
Much like religion, politics, sports teams or even which band wrote the greatest songs, the term "Big Data" becomes a highly polarising term, due largely to the subjective perception an individual has of what Big Data actually is.
In order to cut through these subjective perceptions, it's imperative we simplify things down to some base assumptions
- Big Data refers to the ecosystem of products that center around Apache Hadoop, designed to store files of variable size, type and latency, or if you prefer, the three Vs - Volume, Variety, Velocity
- Big Data solves particular data challenges within the modern enterprise, but not all data challenges. Hadoop by itself is not a replacement to a transaction data store, or a data warehouse environment.
- Big Data is not in itself, a reporting or advanced analytics tool. While there are several pieces within the Hadoop ecosystem that can support these tasks, the primary purpose for adopting Big Data should be to deal with large volumes/large variety/high velocity data sets.
The need to set these base assumptions within any Big Data project is essential. In almost every example, Big Data projects fail ultimately due to a misunderstanding between all parties involved around what they're actually purchasing, what new capabilities they will receive and what type of resource are required to utilise this new technology. With that in mind, let's look at the four main pillars of the Hadoop ecosystem.
The Hadoop Ecosystem
The Hadoop ecosystem is comprised of many interchangeable parts, however in all examples will be based off of a technology known as Hadoop Distributed File System - otherwise known as HDFS.
Storage
HDFS in essence, is a storage layer that can store any type of information into file chunks known as "Blocks" on a Hadoop cluster. These "blocks" are replicated across data nodes within the cluster and due to their ability to store any type of information or file type, are highly versatile in dealing with large machine generated information, or images, or video and audio. HDFS also supports these files being compressed at the time they're ingested into the cluster, meaning a much more cost effective storage option for these large files than compared to an ordinary file system or a database system.
Task Operations
In the first instances of Hadoop, HDFS was primarily backed up by a technology known as Map/Reduce. Map/Reduce takes advantage of the distributed file system by being highly efficient at distributing tasks across a cluster that perform one of two functions
Map: A Map task seeks to look through all of the blocks stored within HDFS to identify certain patterns of text, or date time stamps, or other indicators that we may want to filter for out of the large volume of information we've captured into HDFS.
Reduce: A Reduce task seeks to ultimately aggregate and count the number of times those particular patterns of text, or date time stamps occur. This allows us to understand how many times something occurred, for example, if we're analysing click-stream logs, we're able to identify how many times a particular user browsed a particular page, or even how many times a particular page fired an event across every single user browsing that page.
Resource Management
In order for Map/Reduce to function, Hadoop (from version 2 onwards) provides a Data Operating System known as "Yet Another Resource Negotiator" - or more commonly known as YARN. YARN provides MapReduce an ability to ask for resources on a cluster, and for those resources to be granted to any task on a level of permission or priority, where previously MapReduce would simply consume all available resource on the cluster for each batch task it undertook, YARN provides a capability for the Hadoop ecosystem to work on multiple tasks at the same time, and choose which tasks are higher priority. Yarn in essence becomes the centerpiece of the cluster.
Scripting
So now we move into the tools that analysts use in order to create these tasks, and to leverage the inbuilt capabilities of HDFS, MapReduce and Yarn. There are two scripting languages that exist within a baseline build of Hadoop, Pig and Hive.
Pig enables users to build complex Map/Reduce tasks using a simplified programming language known as Pig Latin. This scripting language is highly extensible and easy to program, making it powerful for iterative data processing, research and exploration within large data files and building pipelines to conduct extract, transform and load tasks within the HDFS environment.
Hive in essence is the "Data Warehouse" layer of HDFS, in that it allows you to build a tabular structure over the top of the files you have existing within Hadoop and then use SQL syntax to interrogate those files. Hive is highly functional and easily transferred to existing team members with their SQL skills, however does not have the same level of flexibility of Pig.
Between HDFS, Map Reduce and the two scripting languages, an organisation now has the ability to take high volume, high complexity pieces of information such as sensor traffic, clickstream events from a server, maintenance events from production and manufacturing lines and be able to efficiently store the information, access it and provide their analysts with tools to interrogate it.
To summarise:
Hadoop isn't a magical replacement for data warehousing within your business, although it can play a role in your archiving strategy - especially in lowering the cost of storing long standing data while still providing accessibility to that information. This means your EDW can focus solely on what is most likely to be queried, rather than also store seven years worth of archival information.
Hadoop also isn't a magical replacement for your statistical analysis and modelling capabilities. While it does include some capabilities through toolsets such as Spark and Mahout, it is still ultimately slower than a high performing analytical database. Hadoop offers you an ability to analyse different sets of information, and to use its tools to distill down to the particular patterns and events that you want to use within your feature selection for your predictive models.
Hadoop brings a clear capability in providing the tools and technology to manage large volumes of information, to archive information on a cost effective basis and provide tools that enable you to find the needle in the haystack. If you're talking in the case of a million rows of new structured data per month, Hadoop is not the right option for that use case. If you're talking a case of heavily unstructured, variable information in the magnitude of a billion events per month, then a properly configured Hadoop cluster is essential in returning business value from these use cases.
In all scenarios, have a clear understanding of the problem you're trying to solve, and a clear understanding of the business outcomes of solving those problems. In many cases, companies fail in their "Big Data" strategies, solely because they didn't have a Big Data problem to solve.
Information Security | Data Engineering | Big Data | Project Management
7yThank you David great article on big data topics.
Architecture & Platform Practice Lead - Data Intelligence Platform
7yGreat summary, David!Will explore more for Falcon/Atlas for Metadata and data governance.