Insights from Netflix's Open Data Engineering Forum 2025

Dipika Jiandani

Senior Data Engineer at McKinsey & Company | Data | Cloud | Architecture | Software

Published Apr 30, 2025

I attended Netflix's Data Engineering Open Forum 2025 at Netflix’s Los Gatos headquarters, and it was packed with deep technical discussions, and big ideas shaping the future of large-scale, high-performance, real-time data platforms.

From Spark 4.0 innovations to the future of lakehouse architectures and data interoperability — here are some of my top takeaways.

Spark 4.0, which is currently in preview, is a major leap forward for scalable, Python-friendly distributed computing.

Spark 4.0 wouldn't just be a version bump, it is a move towards a more modular, high performance, and Python native Spark ecosystem. Some features that excited me the most are:

Spark Connect: A major architectural shift that separates and decouples Spark clients from Spark drivers — solving many multi-tenant operation challenges and enabling more flexible deployments and remote execution in a more modular, language-agnostic and scalable way.
Python UDF Performance: Traditional Python UDFs hit serialization bottlenecks between Python workers and Spark executors. With Arrow-optimized UDFs (UseArrow=true), operations now run significantly faster and support better type handling.
Python UDTFs (User-Defined Table Functions): Spark 4.0 introduces native support for Python UDTFs, allowing Python functions to return full tables — opening new doors for complex transformations.
Python Data Source API: Developers can now build, register, and read/write from custom Python-based data sources. Especially useful for generating synthetic data or ingesting from custom systems, with Arrow integration and partitioning support (file/date-based).

Apache XTable is an open-source project (in incubation) that enables interoperability across lakehouse table formats — Delta Lake, Apache Iceberg, and Apache Hudi — by standardizing how metadata is accessed across different engines.

Recommended by LinkedIn

A Taxonomy of the AI Database Ecosystem

Vincent Granville 9 months ago

Mastering the PySpark Developer Interview: Key…

KRISHNAKANT K. 3 months ago

DATA Pill #089 - Looker, dbt, real-time streaming…

Adam Kawa 1 year ago

What It Does:

Still incubating, Apache XTable aims to standardize table metadata access and interoperability across engines like Spark, Trino, Flink, Hive, and more — without introducing yet another format.
Introduces a metadata translation layer that decouples compute engines from format-specific metadata. It allows cross-engine querying of the same data (e.g., Spark, Trino, Flink, Hive) without data duplication.

The big shift?

Modern data platforms need to be real-time, high performing, transactionally reliable, and developer-first to keep up with new product demands.

Excited to dig deeper and apply these ideas in upcoming projects!

#Netflix #Spark4 #DataEngineering #BigData #RealTimeAnalytics #Lakehouse #Databricks #Python

Insights from Netflix's Open Data Engineering Forum 2025

Dipika Jiandani

Senior Data Engineer at McKinsey & Company | Data | Cloud | Architecture | Software

Recommended by LinkedIn

The big shift?

Insights from the community

Others also viewed

Handling Big Data with Python

Understanding How Apache MLlib Empowers Scalable Machine Learning with Apache Spark

Data Engineering in Action: Real-World Use Cases with Python

PySpark

Python in Data Engineering: Powering Databricks, Snowflake, dbt, and Airflow for Big Data Pipelines

Python Template: Incrementally Read S3 Objects from SQS Queue as Spark DataFrame | Hands on Labs

Real-Time Sentiment Analysis with TCP Socket, Apache Spark, OpenAI, Kafka and Elasticsearch | Data Enginering pipeline project

Essential Tools for Data Science: Powering Analytical Insights 🚀

Understanding PySpark Architecture: A Deep Dive into Distributed Data Processing

Understanding PySpark Architecture: A Deep Dive into Distributed Data Processing

Explore topics