Why parquet file is popular in big data storage, processing and analytics. here is my research…

Tushar Jadhav

Cloud Data Architect | Sustainability Data Solutions at EY

Published Jul 13, 2022

Parquet is column-oriented storage format whereas others such as csv, flat files are row oriented. Columnar storage highly compresses data and performs deduplication.

Parquet format always stores data in encoded format and is not human readable. Most commonly it uses dictionary encoding which is highly effective and compressible. It builds a dictionary of values encountered in a given column. The dictionary will be stored in a dictionary page per column chunk. The values are stored as integer values for your reference data just like pointers.

Visual representation of how actual data is stored in parquet file storage format -

I wanted to understand deeply and tried to find out how it works in reality. See below table about what I observed using different compression techniques for parquet when I performed some tests on approx… 50 gigs of data on Azure storage.

My personal favorite is Snappy since it is fastest in all for big data processing but you are free to choose the best suits your needs.

Ravi Dhiman

Data Architect| Microsoft Certified Data Engineer Associate | Microsoft Certified Power BI Analyst | Microsoft Certified SQL Developer

Great Article Tushar ! Any thoughts on handling Nested Data with Parquet.

1 Reaction

See more comments

To view or add a comment, sign in

Why parquet file is popular in big data storage, processing and analytics. here is my research…

Tushar Jadhav

Cloud Data Architect | Sustainability Data Solutions at EY

More articles by Tushar Jadhav

Insights from the community

Others also viewed

How to Implement Dim_Date in Microsoft Fabric using PySpark

High Throughput and Low Latency in ADLS Gen 2

High Throughput and Low Latency in ADLS Gen 2

High Throughput and Low Latency in ADLS Gen 2

🚀 Demystifying RDDs and DataFrames: Clearing the Cloud of Confusion! 🌟

Azure Synapse Link

Why should I use TypeDB for my graph data?

Microsoft Fabric April 2025 Update

A week in Kusto and SQL

Explore topics

More articles by Tushar Jadhav

Cloud data migration and foundation for AI/ML

Significance of data quality to overcome barrier to AI Success!

Near Real Time Analytics with Azure

Scale-Up vs. Scale-Out Storage Systems

Insights from the community

Others also viewed

How to Implement Dim_Date in Microsoft Fabric using PySpark

High Throughput and Low Latency in ADLS Gen 2

High Throughput and Low Latency in ADLS Gen 2

High Throughput and Low Latency in ADLS Gen 2

🚀 Demystifying RDDs and DataFrames: Clearing the Cloud of Confusion! 🌟

Azure Synapse Link

Why should I use TypeDB for my graph data?

Microsoft Fabric April 2025 Update

A week in Kusto and SQL

Explore topics