Big Confusion about Big Data...Understand terms like; Hive , Hadoop , HBase , Sqoop....

Big Confusion about Big Data...Understand terms like; Hive , Hadoop , HBase , Sqoop....

Dear Folks !!!

People often confuse what exactly the term “Big Data” is, What is "Hadoop"

First, it’s important to define what Big Data is. First of all, I refer to Big Data to mean the data itself – although it is often used interchangeably with the solutions (such as Hadoop). Big Data is not a Technology but the collection of Large volume of Structured or Unstructured Data. I believe that data should satisfy 3 criteria before being considered “Big Data”:

  • Volume – the amount of data has to be large, in petabytes not just gigabytes
  • Velocity – the data has to be frequent, daily or even real-time
  • Structure – the data is typically but not always unstructured (like videos, tweets, chats)

Now understand the terms which we commonly see in a profile for Big data

Hadoop: Apache Hadoop is an excellent framework for processing, storing and analyzing large volumes of unstructured data - aka Big Data.

Hadoop Distributed File System: HDFS, the storage layer of Hadoop, is a distributed, scalable, Java-based file system adept at storing large volumes of unstructured data.

Hive: Hive is a Hadoop-based data warehousing-like framework originally developed by Facebook. It allows users to write queries in a SQL-like language caled HiveQL, which are then converted to MapReduce.

MapReduce: MapReduce is a software framework that serves as the compute layer of Hadoop. MapReduce jobs are divided into two (obviously named) parts. The “Map” function divides a query into multiple parts and processes data at the node level. The “Reduce” function aggregates the results of the “Map” function to determine the “answer” to the query.

Flume: Flume is a framework for populating Hadoop with data.Agents are populated throughout ones IT infrastructure – inside web servers, application servers and mobile devices, for example – to collect data and integrate it into Hadoop

Sqoop: Sqoop is a connectivity tool for moving data from non-Hadoop data stores – such as relational databases and data warehouses – into Hadoop. It allows users to specify the target location inside of Hadoop and instruct Sqoop to move data from Oracle, Teradata or other relational databases to the target.

Mahout: Mahout is a data mining library. It takes the most popular data mining algorithms for performing clustering, regression testing and statistical modeling and implements them using the Map Reduce model.

Thanks...

Hira Jha/-

To view or add a comment, sign in

More articles by Hira Jha

Insights from the community

Others also viewed

Explore topics