🚀 Day 10 Learning: Diving into Hadoop Ecosystem

🚀 Day 10 Learning: Diving into Hadoop Ecosystem

Date: 2025‑05‑08


1. Introduction

On Day 10, I just started unboxing the Hadoop framework, focusing on its core division between storage and processing: HDFS and MapReduce. I also skimmed through complementary ecosystem tools like Hive, Pig, Sqoop, and reviewed major commercial distributions (Cloudera, Hortonworks, MapR, IBM, Microsoft Azure HDInsight, AWS EMR, Databricks). 🌐📦


2. Topics Covered

🔹 Hadoop Core • HDFS vs. MapReduce roles • Architecture overview (NameNode, DataNode, ResourceManager)

🔹 Ecosystem Tools • Hive (SQL-on-Hadoop) • Pig (data flow scripting) • Sqoop (RDBMS import/export) • Oozie (workflow scheduler)

🔹 Commercial Distributions • Cloudera, Hortonworks, MapR • IBM BigInsights, Azure HDInsight, AWS EMR, Databricks


3. Code Window – Core & Ecosystem Commands

# HDFS Basic (verify storage)
hdfs dfs -ls /
        
# MapReduce Streaming WordCount
yarn jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \
  -input /user/dineshsj/input \
  -output /user/dineshsj/output \
  -mapper mapper.py \
  -reducer "uniq -c"
        
-- Hive: Create table and run simple query
CREATE EXTERNAL TABLE IF NOT EXISTS sales(
  id INT, amount DOUBLE, category STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '/user/dineshsj/sales';

SELECT category, AVG(amount) AS avg_amt
FROM sales
GROUP BY category;
        
-- Pig Latin: Load data and filter
raw = LOAD 'sales.csv' USING PigStorage(',') AS (id:int, amount:double, category:chararray);
filtered = FILTER raw BY amount > 100.0;
result = GROUP filtered BY category;
avg_amt = FOREACH result GENERATE group, AVG(filtered.amount);
DUMP avg_amt;
        
# Sqoop: Import RDBMS table into HDFS
sqoop import \
  --connect jdbc:mysql://localhost/salesdb \
  --username user --password pass \
  --table sales \
  --target-dir /user/dineshsj/sales_sql
        

4. Key Takeaways

✅ Hadoop cleanly separates storage (HDFS) from processing (MapReduce/YARN). ✅ Hive and Pig provide higher‑level abstractions for SQL and scripting on big data. ✅ Sqoop simplifies moving data between Hadoop and relational databases. ✅ Major vendors (Cloudera, Hortonworks, MapR, IBM, Microsoft, AWS, Databricks) package Hadoop with enterprise tools, security, and support. ✅ Understanding the ecosystem helps choose the right tool for ETL, querying, and workflows.


5. Tomorrow’s Goals

  • Set up a single‑node Hadoop cluster in pseudo‑distributed mode
  • Run basic HDFS commands to verify file operations
  • Execute a sample MapReduce WordCount job on the new cluster
  • Explore Hive CLI to run simple SQL queries on HDFS data


6. References


7. Hashtags

#Day10 #Hadoop #BigData #HDFS #MapReduce #YARN #DataEngineering #100DaysOfCode #2025Learning #BigDataAnalytics #TechJourney #DataLake #OpenSource

To view or add a comment, sign in

More articles by Dinesh S J

Explore topics