HDFS - Read to Analyze

HDFS - Read to Analyze


HDFS CLI

The HDFS CLI (Command-Line Interface) provides a way to interact with the Hadoop Distributed File System (HDFS) using simple commands. You can perform file operations, directory management, permissions handling, and data transfers using the hdfs dfs command.

Basic Commands for analyzing data:

Article content
HDFS Data Handling: View, Copy & Analyze Files Efficiently!



Hive

Apache Hive provides an SQL-like interface to process structured and semi-structured data stored in HDFS. It abstracts MapReduce, Tez, or Spark execution engines, making it easier for analysts to query and transform large datasets.

Reading raw files:

Hive supports reading raw files from HDFS and processing them on the fly.

Article content
Hive Reading raw files

Querying data with tables.

Hive allows two types of tables:

  1. Managed Tables – Data is managed and stored by Hive. Data is tightly coupled with table.
  2. External Tables – Data remains in HDFS, but Hive only manages metadata. Drop table command only deletes metadata from hive underlying data remains same on HDFS

While creating tables the main syntactical difference lies in the EXTERNAL keyword and the LOCATION clause when creating a table.

Article content
Managed vs. External Tables: Spot the Syntax Twist! 🔍



Spark

Apache Spark provides multiple ways to directly query data stored in HDFS using Spark SQL, DataFrames, and RDDs. Unlike Hive, Spark can process data in-memory, making it much faster for analytics and transformations. This approach is suitable for Big data processing.

For analytical purposes, the Spark shell and PySpark CLI provide useful tools.

Article content
Fast & Scalable HDFS Analytics with Apache Spark!



Impala

Impala supports external tables only, so data remains in HDFS and syntax is same as mentioned in hive section

  • Impala is designed for fast queries, not for managing data storage.
  • Impala relies on Hive Metastore, which already supports external tables.
  • Impala’s architecture doesn’t support data movement or deletion.



Presto/Trino

Presto (now Trino) is a high-performance, distributed SQL query engine optimized for fast interactive queries on large datasets stored in HDFS, S3, Hive, and other sources. It is designed for low-latency analytics, unlike Hive, which is optimized for batch processing.

Presto does not store metadata but relies on the Hive Metastore to understand table schemas and data locations in HDFS.

  • Presto depends on Hive Metastore for table metadata but does not use Hive for execution.
  • Presto can query existing Hive tables without data migration.
  • Presto is much faster than Hive for interactive queries because it uses in-memory execution



HBase

Unlike Hive or Impala, which use SQL for batch and interactive queries, HBase is optimized for real-time, random read/write access.

  • HBase is best for real-time, random read/write workloads.
  • Does not support SQL natively but integrates with Hive for querying.
  • Works well for high-ingest applications like IoT, logs, and analytics.



What is the Hive Metastore?

The Hive Metastore (HMS) is a central metadata repository that stores table definitions, schema details, partitions, and storage locations for Hive tables. It acts as a bridge between structured data in HDFS and SQL-based query engines like Hive, Spark, Presto, and Impala.



Programming

Spark SQL Apis : Java, Scala, pyspark

python : hdfs module to read HDFS files

Java: Java Hadoop API to read HDFS files




To view or add a comment, sign in

More articles by Radha Kuchekar

  • Apache Airflow

    Apache Airflow

    Apache Airflow Apache airflow is powerful tool allows you to author and run workflows. Each workflow specifies the set…

  • Data Locality

    Data Locality

    Data locality in Apache Spark refers to how close the data is to the computation that processes it. Since Spark is…

  • Oracle To Hive/HDFS Ingestion

    Oracle To Hive/HDFS Ingestion

    Ingesting data from Oracle to Hive/HDFS is a common workflow in big data ecosystems. This process involves extracting…

  • Data Reconciliation With Spark SQL

    Data Reconciliation With Spark SQL

    Data reconciliation is the process of comparing and validating data from different sources to ensure consistency…

  • SQL Optimization

    SQL Optimization

    Indexing Suitable for traditional SQL databases (MySQL, PostgreSQL, SQL Server, Oracle, etc. Ensure proper indexes…

  • Small Files

    Small Files

    A small file is typically defined as a file significantly smaller than the HDFS (Hadoop Distributed File System) block…

  • Schema Evolution

    Schema Evolution

    Adding a New Column (Backward-Compatible Change) Fig. Add New Column Adding a column is a backward-compatible change…

  • Resource Allocation

    Resource Allocation

    Efficient resource allocation in Apache Spark is crucial for optimizing performance, reducing execution time, and…

    2 Comments
  • Window and Robbers

    Window and Robbers

    Window Unlike group aggregation functions, window functions do not collapse rows into a single output but instead…

  • Job, Stages, and Tasks

    Job, Stages, and Tasks

    Job -> Created for each action. Job Creation Key Scenarios in Apache Spark Stage -> Created at each shuffle boundary.

Insights from the community

Others also viewed

Explore topics