Insights from Netflix's Open Data Engineering Forum 2025

I attended Netflix's Data Engineering Open Forum 2025 at Netflix’s Los Gatos headquarters, and it was packed with deep technical discussions, and big ideas shaping the future of large-scale, high-performance, real-time data platforms.

From Spark 4.0 innovations to the future of lakehouse architectures and data interoperability — here are some of my top takeaways.


Spark 4.0, which is currently in preview, is a major leap forward for scalable, Python-friendly distributed computing.

Spark 4.0 wouldn't just be a version bump, it is a move towards a more modular, high performance, and Python native Spark ecosystem. Some features that excited me the most are:

  • Spark Connect: A major architectural shift that separates and decouples Spark clients from Spark drivers — solving many multi-tenant operation challenges and enabling more flexible deployments and remote execution in a more modular, language-agnostic and scalable way.
  • Python UDF Performance: Traditional Python UDFs hit serialization bottlenecks between Python workers and Spark executors. With Arrow-optimized UDFs (UseArrow=true), operations now run significantly faster and support better type handling.
  • Python UDTFs (User-Defined Table Functions): Spark 4.0 introduces native support for Python UDTFs, allowing Python functions to return full tables — opening new doors for complex transformations.
  • Python Data Source API: Developers can now build, register, and read/write from custom Python-based data sources. Especially useful for generating synthetic data or ingesting from custom systems, with Arrow integration and partitioning support (file/date-based).


Apache XTable is an open-source project (in incubation) that enables interoperability across lakehouse table formats — Delta Lake, Apache Iceberg, and Apache Hudi — by standardizing how metadata is accessed across different engines.

What It Does:

  • Still incubating, Apache XTable aims to standardize table metadata access and interoperability across engines like Spark, Trino, Flink, Hive, and more — without introducing yet another format.
  • Introduces a metadata translation layer that decouples compute engines from format-specific metadata. It allows cross-engine querying of the same data (e.g., Spark, Trino, Flink, Hive) without data duplication.


The big shift?

Modern data platforms need to be real-time, high performing, transactionally reliable, and developer-first to keep up with new product demands.

Excited to dig deeper and apply these ideas in upcoming projects!

#Netflix #Spark4 #DataEngineering #BigData #RealTimeAnalytics #Lakehouse #Databricks #Python

Mrudali Birla

Building Products | Experience, growth and design enthusiast | University of Washington

6d

💡 Great insight

To view or add a comment, sign in

Insights from the community

Others also viewed

Explore topics