One Table, Two Engines: Building a Unified Lakehouse with Spark, Trino, and Apache Iceberg
In today’s hybrid data world, interoperability is no longer a luxury—it’s a necessity. This blog dives into how you can seamlessly bridge Spark and Trino Trino Software Foundation with Apache Iceberg for a unified, future-proof data lakehouse.
What is Interoperability?
Interoperability in a data lake architecture means enabling multiple execution engines, like Apache Spark and Trino, to seamlessly access and operate on the same dataset. This capability eliminates redundant processing, ensures consistency, and unlocks diverse workloads in a unified environment.
In this blog, I’ll walk you through configuring Spark and Trino to share Apache Iceberg tables stored on S3, backed by a Hive metastore on AWS RDS. With this setup, you’ll achieve seamless read and write operations across both engines, powering efficient hybrid workflows.
If you’re new to Apache Iceberg or curious why it’s the future-proof solution for data lakes, explore my previous blog on Apache Iceberg.
Why Interoperability Matters?
In modern data processing, hybrid workloads are essential. Spark shines in handling complex ETL workflows, while Trino is unparalleled in ad-hoc analytics and federated queries. By enabling both engines to access and modify the same Iceberg tables stored on S3, backed by a Hive metastore, you eliminate redundancy, reduce latency, and maintain consistency across all operations. This interoperability ensures your data architecture is both efficient and flexible.
Setting Up The Environment
To create a shared Iceberg environment for Spark and Trino, ensure the following prerequisites are met:
These steps establish a unified metadata layer, ensuring Spark and Trino can seamlessly share Iceberg tables for diverse workloads.
Configure Apache Spark for Iceberg
Apache Iceberg transforms your data lake into a high-performance, open data lakehouse. To achieve this, Apache Spark needs to integrate seamlessly with Iceberg through a Hive metastore. Here’s how to set up Spark for Iceberg while leveraging a Hive metastore hosted on AWS RDS.
Step 1: Setting Up the Hive Metastore
The Hive metastore serves as the metadata layer for your Iceberg tables, acting like a centralized "lookup table." Each table name maps to the directory in S3 holding the files that represent its data. Here’s how to configure the Hive metastore:
"Classification": "hive-site",
"Properties": {
"javax.jdo.option.ConnectionDriverName": "<JDBC-DRIVER-OF-YOUR-RDS>",
"javax.jdo.option.ConnectionPassword": "<RDS-DB-PASSWORD>",
"javax.jdo.option.ConnectionURL": "<RDS-JDBC-CONNECTION-STRING>",
"javax.jdo.option.ConnectionUserName": "<RDS-DB-USERNAME>"
}
This configuration ensures the Hive metastore communicates with AWS RDS to store metadata.
Step 2: Configuring Spark to Use the Hive Metastore
Once the Hive metastore is ready, Spark must connect to it via the Thrift protocol. The Hive URI includes the IP or hostname of the EMR master node or your dedicated Hive metastore host:
"Classification": "spark-defaults",
"Properties": {
"spark.hadoop.hive.metastore.uris": "thrift://<EMR-IP-OR-MASTER-NODE-IP>:<HIVE-POrT-NUMBER>",
"spark.hadoop.javax.jdo.option.ConnectionDriverName": "<JDBC-DRIVER-OF-YOUR-RDS>",
"spark.hadoop.javax.jdo.option.ConnectionPassword": "<RDS-DB-PASSWORD>",
"spark.hadoop.javax.jdo.option.ConnectionURL": "<RDS-JDBC-CONNECTION-STRING>",
"spark.hadoop.javax.jdo.option.ConnectionUserName": "<RDS-DB-USERNAME>",
"spark.jars": "<LOCATION-OF-SPARK-ICEBERG-JAR-FILE>", //"iceberg-spark3-runtime.jar"
"spark.sql.catalog.iceberg": "org.apache.iceberg.spark.SparkCatalog",
"spark.sql.catalog.iceberg.catalog-impl": "org.apache.iceberg.hive.HiveCatalog",
"spark.sql.catalog.iceberg.warehouse": "<S3-DIRECTORY-TO-LOCATE-YOUR-TABLES-BY-DEFAULT>",
"spark.sql.extensions": "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions"
}
Step 3: Writing Data to Iceberg
Iceberg's robust support for transactions and schema evolution makes data writing seamless. Use the following code to write data into an Iceberg table:
df.writeTo("iceberg.<SCHEMA-NAME>.<TABLE-NAME>")
.using("iceberg")
.createOrReplace()
Note: The above code leverages DataFrameWriterV2, which is compatible with the current Iceberg release. If you're using an older release, consult the Iceberg documentation for compatibility.
Step 4: Reading Data from Iceberg
To read data from Iceberg tables in Spark, you can use either of these methods:
df = spark.table("iceberg.<SCHEMA-NAME>.<TABLE-NAME>")
Recommended by LinkedIn
df = spark.read.format("iceberg").load("<SCHEMA-NAME>.<TABLE-NAME>")
By setting up Spark with Iceberg, you enable powerful data operations, such as time travel and incremental snapshots, while maintaining consistency and performance. Ready to supercharge your workflows? Let’s dive into configuring Trino next!
Configure Trino for Iceberg
Setting up Trino to work with Apache Iceberg is straightforward and enables powerful ad-hoc querying on shared tables. Here’s how to configure Trino to connect to Iceberg through a Hive metastore.
Step 1: Set Up the Hive Metastore
Just like with Spark, Trino requires a Hive metastore to manage table metadata. Use the following configuration for your Hive metastore, connecting it to your AWS RDS instance:
"Classification": "hive-site",
"Properties": {
"javax.jdo.option.ConnectionDriverName": "<JDBC-DRIVER-OF-YOUR-RDS>",
"javax.jdo.option.ConnectionPassword": "<RDS-DB-PASSWORD>",
"javax.jdo.option.ConnectionURL": "<RDS-JDBC-CONNECTION-STRING>",
"javax.jdo.option.ConnectionUserName": "<RDS-DB-USERNAME>"
}
This step ensures that Trino and Spark share the same metadata repository, enabling seamless access to Iceberg tables.
Step 2: Configure Trino Catalog for Iceberg
In Trino, catalogs are configured using a `catalog.properties` file. To set up Iceberg, create or modify the configuration file at `/etc/trino/conf/catalog/iceberg.properties` with the following:
connector.name=iceberg
hive.metastore.uri=thrift://<EMR-IP-OR-MASTER-NODE-IP>:<HIVE-PORT-NUMBER>
iceberg.catalog.type=HIVE_METASTORE
This configuration ensures that Trino queries the same metadata as Spark, providing a unified view of Iceberg tables.
Step 3: Query Iceberg Tables in Trino
Once the catalog is configured, querying Iceberg tables in Trino is as simple as writing standard SQL:
SELECT * FROM iceberg.<SCHEMA-NAME>.<TABLE-NAME>;
With this setup, Trino can perform fast, ad-hoc queries on Iceberg tables stored in S3, ensuring compatibility with Spark and leveraging the power of a shared Hive metastore.
By integrating Trino with Iceberg, you gain the ability to query large datasets effortlessly, while maintaining consistency across engines. Ready to unlock seamless data access? Start querying!
Benefits and Best Practices
Integrating Apache Spark and Trino with a shared Iceberg table ensures seamless workflows and efficient operations. Here’s why it matters and how to optimize:
Benefits
Best Practices
Unlock the full potential of your data lake by implementing these best practices today!
Team Leader at NULL
3moGreat article, I have a few follow-up questions: 1. Why use the Hive Metastore, when it seems everyone is leaning towards Iceberg Rest Catalogs? 2. A big blocker which we seem to face is the Interoperability of views, which you didn't seem to mention - Views created in Trino might not be able to be queried in Spark
Data Analyst
5moWell done Gamal! This approach unlocks a lot in terms of efficiency for large tech companies which have different data warehouses.
Senior Consultant - Data & Analytics | xOracle | Data Cloud Architect | ETL | DWH | PowerBI | ❄️ Snowflake | Azure Data Engineering | Microsoft Fabric
5moAmazing article Gamal!