Why parquet file is popular in big data storage, processing and analytics. here is my research…
Parquet is column-oriented storage format whereas others such as csv, flat files are row oriented. Columnar storage highly compresses data and performs deduplication.
Parquet format always stores data in encoded format and is not human readable. Most commonly it uses dictionary encoding which is highly effective and compressible. It builds a dictionary of values encountered in a given column. The dictionary will be stored in a dictionary page per column chunk. The values are stored as integer values for your reference data just like pointers.
Visual representation of how actual data is stored in parquet file storage format -
I wanted to understand deeply and tried to find out how it works in reality. See below table about what I observed using different compression techniques for parquet when I performed some tests on approx… 50 gigs of data on Azure storage.
My personal favorite is Snappy since it is fastest in all for big data processing but you are free to choose the best suits your needs.
Data Architect| Microsoft Certified Data Engineer Associate | Microsoft Certified Power BI Analyst | Microsoft Certified SQL Developer
2yGreat Article Tushar ! Any thoughts on handling Nested Data with Parquet.