Apache Drill Architecture
Its a low latency distributed query engine for large-scale datasets.
Drill is designed to scale to several thousands of nodes and query petabytes of data at interactive speeds that BI/Analytics environments require
Whats is drill bit ? : Its a service that take request from client and process query and give result back.
Architecture: No Master Slave Configuration(ZK is a game changer)
Client send a request to any one of the node , The Drillbit then parses the query, optimizes it, and generates a distributed query plan that is optimized for fast and efficient execution
- The Drill client issues a query. A Drill client is a JDBC, ODBC, command line interface or a REST API. Any Drillbit in the cluster can accept queries from the clients. There is no master-slave concept.
- The Drillbit then parses the query, optimizes it, and generates a distributed query plan that is optimized for fast and efficient execution.
- The Drillbit that accepts the query becomes the driving Drillbit node for the request. It gets a list of available Drillbit nodes in the cluster from ZooKeeper. The driving Drillbit determines the appropriate nodes to execute various query plan fragments to maximize data locality.
- The Drillbit schedules the execution of query fragments on individual nodes according to the execution plan.
- The individual nodes finish their execution and return data to the driving Drillbit.
- The driving Drillbit streams results back to the client.
Core Mode of Drill bit :
RPC end point: Drill exposes a low overhead protobuf-based RPC protocol to communicate with the clients.
SQL parser: Drill uses Optiq, the open source framework, to parse incoming queries.
Optimizer: Drill uses various standard database optimizations such as rule based/cost based, as well as data locality and other optimization rules exposed by the storage engine to re-write and split the query
Execution engine: Drill provides a MPP execution engine built to perform distributed query processing across the various nodes in the cluster
Storage plugins provide Drill with the following information:
• Metadata available in the source
• Interfaces for Drill to read from and write to data sources
• Location of data and a set of optimization rules to help with efficient and faster execution of Drill queries on a specific data source
Same as other In Memory databases :
Drill integration with Hive is only for metadata. Drill does not invoke the Hive execution engine for any requests.
but one good ting is De -centralized metastore : De-centralized metadata also means that Drill is NOT tied to a single Hive repository either. Users can query from multiple Hive repositories in a single query and then combine data with information from HBase tables or a file in the distributed file system.
Distributed cache: Drill uses a distributed cache to manage metadata (not the data) and configuration information across various nodes
Columnar execution: Drill optimizes for both columnar storage and execution by using an in-memory data model that is hierarchical and columnar. When working with data stored in columnar formats such as Parquet, Drill avoids disk access for columns that are not involved in an analytic query. Drill also provides an execution layer that performs SQL processing directly on columnar data without row materialization. The combination of optimizations for columnar storage and direct columnar execution significantly lowers memory footprints and provides faster execution of BI/Analytic type of workloads.
Quess
8yHi iam trying to reach you, we have requirement for Kovid Group in Big Data technologies.