HDFS - Read to Analyze
HDFS CLI
The HDFS CLI (Command-Line Interface) provides a way to interact with the Hadoop Distributed File System (HDFS) using simple commands. You can perform file operations, directory management, permissions handling, and data transfers using the hdfs dfs command.
Basic Commands for analyzing data:
Hive
Apache Hive provides an SQL-like interface to process structured and semi-structured data stored in HDFS. It abstracts MapReduce, Tez, or Spark execution engines, making it easier for analysts to query and transform large datasets.
Reading raw files:
Hive supports reading raw files from HDFS and processing them on the fly.
Querying data with tables.
Hive allows two types of tables:
While creating tables the main syntactical difference lies in the EXTERNAL keyword and the LOCATION clause when creating a table.
Spark
Apache Spark provides multiple ways to directly query data stored in HDFS using Spark SQL, DataFrames, and RDDs. Unlike Hive, Spark can process data in-memory, making it much faster for analytics and transformations. This approach is suitable for Big data processing.
For analytical purposes, the Spark shell and PySpark CLI provide useful tools.
Recommended by LinkedIn
Impala
Impala supports external tables only, so data remains in HDFS and syntax is same as mentioned in hive section
Presto/Trino
Presto (now Trino) is a high-performance, distributed SQL query engine optimized for fast interactive queries on large datasets stored in HDFS, S3, Hive, and other sources. It is designed for low-latency analytics, unlike Hive, which is optimized for batch processing.
Presto does not store metadata but relies on the Hive Metastore to understand table schemas and data locations in HDFS.
HBase
Unlike Hive or Impala, which use SQL for batch and interactive queries, HBase is optimized for real-time, random read/write access.
What is the Hive Metastore?
The Hive Metastore (HMS) is a central metadata repository that stores table definitions, schema details, partitions, and storage locations for Hive tables. It acts as a bridge between structured data in HDFS and SQL-based query engines like Hive, Spark, Presto, and Impala.
Programming
Spark SQL Apis : Java, Scala, pyspark
python : hdfs module to read HDFS files
Java: Java Hadoop API to read HDFS files