Matei Zaharia

I recently started an assistant professor position in CSAIL at MIT, where I work on computer systems and big data. Before that, I obtained my PhD at UC Berkeley. I’m also co-founder and CTO of Databricks, the big data company commercializing Apache Spark.

You can contact me at matei@mit.edu or find me in the Stata center, office 32G-996.

Projects

I work on systems and algorithms for large-scale data-intensive computing. My projects include:

Spark: As big data analytics evolves beyond simple batch jobs, there is a need for both more complex multi-stage applications (e.g. machine learning algorithms) and more interactive ad-hoc queries. Spark provides an efficient abstraction for in-memory cluster computing called Resilient Distributed Datasets, and can run 100x faster than Hadoop for these applications. (homepage) (short paper) (NSDI’12 paper)

Shark: This high-speed query engine runs Hive SQL queries on top of Spark up to 100x faster than Hive, and supports fault recovery and complex analytics (e.g. machine learning). (homepage) (SIGMOD’13)

Mesos: Clusters are running increasingly diverse applications, from batch jobs to interactive services. Mesos is a cluster manager that efficiently supports diverse applications by letting them control their own scheduling. The project is open source in the Apache Incubator. (homepage) (NSDI’11 paper)

Multi-Resource Fairness: Life is not fair, but with a little help, your computer system can be, ensuring predictable time-sharing between users. However, past work on fair sharing considered a single resource (e.g. CPU), while cluster applications have demands across multiple resources (memory, IO, CPU, etc). Dominant resource fairness generalizes max-min fairness for this case. (NSDI’11) (SIGCOMM’12)

MapReduce Scheduling: I’ve worked on several scheduling algorithms for MapReduce, including the LATE algorithm for straggler mitigation (OSDI’08) and delay scheduling for data locality (Eurosys’10). Both algorithms are now included in Hadoop. I also developed the Hadoop Fair Scheduler.

SNAP Sequence Aligner: To tackle the growing volume of genomic data, SNAP is a new sequence alignment algorithm that is 10-100x faster than current tools and also more accurate. (homepage) (arXiv)

To learn more about my graduate research, you can also read my job application materials.

Publications

2015

2014

2013

2012

2011

2010

Earlier

Full Publication List and Technical Reports

Talks

Open Source

Almost all of my work is open source:

I’m also a committer on the Apache Hadoop, Spark and Mesos projects.

Adapted from a template by Andreas Viklund.

  翻译: