Why parquet file is popular in big data storage, processing and analytics. here is my research…

Why parquet file is popular in big data storage, processing and analytics. here is my research…

Parquet is column-oriented storage format whereas others such as csv, flat files are row oriented. Columnar storage highly compresses data and performs deduplication.

Parquet format always stores data in encoded format and is not human readable. Most commonly it uses dictionary encoding which is highly effective and compressible. It builds a dictionary of values encountered in a given column. The dictionary will be stored in a dictionary page per column chunk. The values are stored as integer values for your reference data just like pointers.

Visual representation of how actual data is stored in parquet file storage format -

No alt text provided for this image

I wanted to understand deeply and tried to find out how it works in reality. See below table about what I observed using different compression techniques for parquet when I performed some tests on approx… 50 gigs of data on Azure storage.

No alt text provided for this image

My personal favorite is Snappy since it is fastest in all for big data processing but you are free to choose the best suits your needs.

Ravi Dhiman

Data Architect| Microsoft Certified Data Engineer Associate | Microsoft Certified Power BI Analyst | Microsoft Certified SQL Developer

2y

Great Article Tushar ! Any thoughts on handling Nested Data with Parquet.

To view or add a comment, sign in

More articles by Tushar Jadhav

Insights from the community

Others also viewed

Explore topics