🚀 Day 10 Learning: Diving into Hadoop Ecosystem
Date: 2025‑05‑08
1. Introduction
On Day 10, I just started unboxing the Hadoop framework, focusing on its core division between storage and processing: HDFS and MapReduce. I also skimmed through complementary ecosystem tools like Hive, Pig, Sqoop, and reviewed major commercial distributions (Cloudera, Hortonworks, MapR, IBM, Microsoft Azure HDInsight, AWS EMR, Databricks). 🌐📦
2. Topics Covered
🔹 Hadoop Core • HDFS vs. MapReduce roles • Architecture overview (NameNode, DataNode, ResourceManager)
🔹 Ecosystem Tools • Hive (SQL-on-Hadoop) • Pig (data flow scripting) • Sqoop (RDBMS import/export) • Oozie (workflow scheduler)
🔹 Commercial Distributions • Cloudera, Hortonworks, MapR • IBM BigInsights, Azure HDInsight, AWS EMR, Databricks
3. Code Window – Core & Ecosystem Commands
# HDFS Basic (verify storage)
hdfs dfs -ls /
# MapReduce Streaming WordCount
yarn jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \
-input /user/dineshsj/input \
-output /user/dineshsj/output \
-mapper mapper.py \
-reducer "uniq -c"
-- Hive: Create table and run simple query
CREATE EXTERNAL TABLE IF NOT EXISTS sales(
id INT, amount DOUBLE, category STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '/user/dineshsj/sales';
SELECT category, AVG(amount) AS avg_amt
FROM sales
GROUP BY category;
-- Pig Latin: Load data and filter
raw = LOAD 'sales.csv' USING PigStorage(',') AS (id:int, amount:double, category:chararray);
filtered = FILTER raw BY amount > 100.0;
result = GROUP filtered BY category;
avg_amt = FOREACH result GENERATE group, AVG(filtered.amount);
DUMP avg_amt;
# Sqoop: Import RDBMS table into HDFS
sqoop import \
--connect jdbc:mysql://localhost/salesdb \
--username user --password pass \
--table sales \
--target-dir /user/dineshsj/sales_sql
4. Key Takeaways
✅ Hadoop cleanly separates storage (HDFS) from processing (MapReduce/YARN). ✅ Hive and Pig provide higher‑level abstractions for SQL and scripting on big data. ✅ Sqoop simplifies moving data between Hadoop and relational databases. ✅ Major vendors (Cloudera, Hortonworks, MapR, IBM, Microsoft, AWS, Databricks) package Hadoop with enterprise tools, security, and support. ✅ Understanding the ecosystem helps choose the right tool for ETL, querying, and workflows.
5. Tomorrow’s Goals
6. References
7. Hashtags
#Day10 #Hadoop #BigData #HDFS #MapReduce #YARN #DataEngineering #100DaysOfCode #2025Learning #BigDataAnalytics #TechJourney #DataLake #OpenSource