Apache Drill Architecture

Apache Drill Architecture

Its a low latency distributed query engine for large-scale datasets.

Drill is designed to scale to several thousands of nodes and query petabytes of data at interactive speeds that BI/Analytics environments require

Whats is drill bit ?   : Its a service that take request from client and process query and give result back.

Architecture: No Master Slave Configuration(ZK is a game changer)   

Client send a request to any one of the node , The Drillbit then parses the query, optimizes it, and generates a distributed query plan that is optimized for fast and efficient execution

  • The Drill client issues a query. A Drill client is a JDBC, ODBC, command line interface or a REST API. Any Drillbit in the cluster can accept queries from the clients. There is no master-slave concept.
  • The Drillbit then parses the query, optimizes it, and generates a distributed query plan that is optimized for fast and efficient execution.
  • The Drillbit that accepts the query becomes the driving Drillbit node for the request. It gets a list of available Drillbit nodes in the cluster from ZooKeeper. The driving Drillbit determines the appropriate nodes to execute various query plan fragments to maximize data locality.
  • The Drillbit schedules the execution of query fragments on individual nodes according to the execution plan.
  • The individual nodes finish their execution and return data to the driving Drillbit.
  • The driving Drillbit streams results back to the client.

Core Mode of Drill bit :

RPC end point: Drill exposes a low overhead protobuf-based RPC protocol to communicate with the clients.

SQL parser: Drill uses Optiq, the open source framework, to parse incoming queries.

Optimizer: Drill uses various standard database optimizations such as rule based/cost based, as well as data locality and other optimization rules exposed by the storage engine to re-write and split the query

Execution engine: Drill provides a MPP execution engine built to perform distributed query processing across the various nodes in the cluster

Storage plugins provide Drill with the following information:

• Metadata available in the source
• Interfaces for Drill to read from and write to data sources
• Location of data and a set of optimization rules to help with efficient and faster execution of Drill queries on a specific data source

Same as other In Memory databases :

Drill integration with Hive is only for metadata. Drill does not invoke the Hive execution engine for any requests.

but one good ting is De -centralized metastore : De-centralized metadata also means that Drill is NOT tied to a single Hive repository either. Users can query from multiple Hive repositories in a single query and then combine data with information from HBase tables or a file in the distributed file system.

Distributed cache: Drill uses a distributed cache to manage metadata (not the data) and configuration information across various nodes

Columnar execution: Drill optimizes for both columnar storage and execution by using an in-memory data model that is hierarchical and columnar. When working with data stored in columnar formats such as Parquet, Drill avoids disk access for columns that are not involved in an analytic query. Drill also provides an execution layer that performs SQL processing directly on columnar data without row materialization. The combination of optimizations for columnar storage and direct columnar execution significantly lowers memory footprints and provides faster execution of BI/Analytic type of workloads.

Hi iam trying to reach you, we have requirement for Kovid Group in Big Data technologies.

Like
Reply

To view or add a comment, sign in

More articles by Vijay Kadel (ヴィジェイ カデル)

  • Data Modeling In Cassandra

    " Read Speed = Minimize Seeks To The Disk " Data model: logical structure of the database and fundamentally determines…

  • Machine Learning

    It`s not a Rocket Science!!!!!!!! Machine Learning : In layman language, we can say Its a type of artificial…

    4 Comments
  • YARN – Walkthrough

    Hadoop is divided into two parts: a)HDFS b)MapReduce MapReduce :Job Tracker and Task Tracker HDFS : Name Node and Data…

    7 Comments
  • Data Security And Governance On Data Lake

    Data lake: Collecting data from heterogeneous sources and storing data in decent format into lake (any file system)…

    9 Comments
  • Why Python For Big Data ?

    Believe Its a Game Changer :-) PyTh-On BigData engineers spends 80 % of time to analyze data and for code it is just of…

    5 Comments
  • Cassandra Architecture ,Installation And Datamodeling

    Cassandra Architecture First word i have to say about this lady,she is adorable and i am in love with her. This is…

  • Apache Hadoop Namenode High Availability

    Introduction To NN HA Prior to Hadoop 2.x, the NameNode was a single point of failure (SPOF) in an HDFS cluster.

  • Hadoop 2.x installaion through Cloudera(CDH)

    Steps for Cloudera Disable SELinux enabled by editing editing Download cludera manager and start by super user "sudo…

Insights from the community

Others also viewed

Explore topics