File Format Identification For Hive Tables
There are many a times when we start working in Hadoop ecosystem and use Hive as data warehouse software for data summarisation, query, and analysis. Right from the word go we are presented with the question as to which file format to choose for creating the Hive Tables from performance and compatibility perspective.
I would like to share few results, I have come across while testing various file formats for Hive tables.
Environment used for the test:
Hadoop Cloudera cluster with CDH version 5.9.0. (4 Node cluster) Hive Version: Hive 1.1.0-cdh5.9.0
As part of this test below listed file formats were analysed:
1.TEXTFILE format. 2.ORC File Format. 3.RCFILE Format. 4.PARQUET File Format. 5.SEQUENCEFILE Format. 6.AVRO File Format.
As part of this test below datasets were used:
All the file formats to store tables were tested for below performance parameters.
- Time taken to write data.
- Time taken for Reading or Querying the Data.
- CPU time taken for loading the data.
- Size of the table files created in each file format.
- Data evolution supported or not.
Read Performance
Below is the snippet which consists of comparison results of tables(with 3 million records) with different file formats for varied query and column types. The tables were tested for query time for:-
• Where clause having one condition.
• Where clause having two conditions.
• Where clause having three conditions.
• GROUP by column 'CATEGORY’ with AGGREGATE function on column ‘AMOUNT’.
• GROUP by column CUSTNO’ with AGGREGATE function on column ‘AMOUNT’.
• JOIN between tables TRANSACTIONS and CUSTOMERS using CUSTNO as the join column(BUCKETS created clustered by CUSTNO on both the tables).
• JOIN between tables TRANSACTIONS and CUSTOMERS using CUSTNO as the join column(BUCKETS created, clustered by ‘STATE’ on TRANSACTIONS table).
• Where clause on a STRUCT column location consisting of fields ‘CITY’ and ‘STATE’ in TRANSACTIONS table.
• DISTINCT keyword with STATE field of STRUCT column in TRANSACTIONS table.
Write Performance
Below is the table capturing key figures for data loading into table stored with different file formats.
Clearly from the table below, Parquet or ORC may optimise partial and full read performance, but they do so at the expense of write performance. Also uncompressed files are fast to write due to the lack of compression.
Storage Performance
Below table shows the file size of the external tables created when stored as different file formats.
PARQUET gives the provision of default compression to 1/10th of the uncompressed version. Benchmarks indicate that ORC files compress to be the smallest of all file formats in Hadoop with explicitly giving any compression codec.
Sequence Files and Avro file format can support limited compression with different codecs.
Observation
Benefits of choosing ORC:
- Considering all the parameters to be tested for a file format to create tables, ORC file format comes out as the best format for below reasons.
- Best with resect to the compression, can save lot of space; if that is a constraint.
- Good with Read performance giving results quickly for most of the queries.
- Provides good writing performance.
- ORC stores collections of rows in one file and within the collection the row data is stored in a columnar format. This allows parallel procession of row collections across a cluster.
Benefits of choosing PARQUET:
- Amongst the other column-oriented binary file formats, PARQUET file format is a better option.
- Parquet is good for queries scanning particular columns within a table, for example querying “wide” tables with many columns or performing aggregation operations like AVG() or SUM() for the values of a single column.
- PARQUET provides compression closer to what ORC file format achieves.
- Unlike RC and ORC files Parquet serdes also support limited schema evolution. In Parquet, new columns can be added at the end of the structure.
Hopefully, these observation would enable you to decide the file formats to be used for Hive tables in your project.
Below are few notes to keep in mind before choosing any file format:
- DATE Datatype not supported in PARQUET.
- Avro files are supported in Hive 0.14.0 and later.
- Parquet columnar storage format supported in Hive 0.13.0 and later.
- For RC File: In order to add a column to your data you must rewrite every pre-existing RC file.
- In Hive, it is very important that Parquet column names are lowercase. If your Parquet file contains mixed case column names, Hive will not be able to read the column and will return queries on the column with null values and not log any errors.
- If you are storing intermediate data between MapReduce jobs, then Sequence files are preferred.
- Avro is great if your schema is going to change over time, but query performance will be slower than ORC or Parquet.
- CSV/TEXT files are excellent if you are going to extract data from Hadoop to bulk load into a database.
Big Data Analytics & Machine Learning Platform
7yDoes the test consider other execution engines for hive ?