How to Query Apache Hudi Tables with Python Using Daft: A Spark-Free Approach
Introduction:
In the world of big data processing, Apache Hudi stands out as a versatile and efficient tool for managing large datasets with ease. While Apache Hudi is commonly used in conjunction with Apache Spark for data processing, there are times when you may want to interact with Hudi tables directly using Python without the overhead of Spark. In this blog post, we'll explore how to query Apache Hudi tables with Python using Daft, a powerful library designed for seamless interaction with Hudi tables, without relying on Spark.
What is Daft?
Daft is a Python library that provides a straightforward interface for reading and writing data to Apache Hudi tables. It abstracts away the complexities of working with Hudi, allowing developers to focus on their data processing logic without the need for Apache Spark. Daft simplifies Hudi operations, making it accessible to a wider audience of Python developers.
Creating Sample Hudi Tables: Before we delve into querying Hudi tables with Daft, let's set up a sample scenario. We'll create a Hudi table called "customers" containing customer data such as customer ID, name, city, email, and more. We'll use Python to generate synthetic customer data and write it to our Hudi table.
Read Hudi Tables using Python
Output
Code :
Conclusion:
In this blog post, we've explored how to query Apache Hudi tables with Python using Daft, a Spark-free approach. By leveraging Daft's intuitive interface, developers can interact with Hudi tables directly from Python, simplifying the data processing workflow. Whether you're building data pipelines, performing analytics, or conducting machine learning tasks, Daft provides a seamless way to work with Apache Hudi, empowering Python developers to unleash the full potential of their data.
Data Engineering - Vision & Strategy | Visual Illustrator | Medium✍️
7moSoumil S. - do we have dedicated page for daft? What is the daft ui you were mentioning in the article??
Senior Software Engineer @ Razorpay
1yHi Soumil S. Great post. One question though. Does using daft as a query engine bring any advantages over using awswrangler(to query AWS Athena), when the hudi data catalogue is made on AWS Glue?