Matei Zaharia

I recently started an assistant professor position in CSAIL at MIT, where I work on computer systems and big data. Before that, I obtained my PhD at UC Berkeley. I’m also co-founder and CTO of Databricks, the big data company commercializing Apache Spark.

You can contact me at matei@mit.edu or find me in the Stata center, office 32G-996.

Projects

I work on systems and algorithms for large-scale data-intensive computing. My projects include:

Spark: As big data analytics evolves beyond simple batch jobs, there is a need for both more complex multi-stage applications (e.g. machine learning algorithms) and more interactive ad-hoc queries. Spark provides an efficient abstraction for in-memory cluster computing called Resilient Distributed Datasets, and can run 100x faster than Hadoop for these applications. (homepage) (short paper) (NSDI’12 paper)

Shark: This high-speed query engine runs Hive SQL queries on top of Spark up to 100x faster than Hive, and supports fault recovery and complex analytics (e.g. machine learning). (homepage) (SIGMOD’13)

Mesos: Clusters are running increasingly diverse applications, from batch jobs to interactive services. Mesos is a cluster manager that efficiently supports diverse applications by letting them control their own scheduling. The project is open source in the Apache Incubator. (homepage) (NSDI’11 paper)

Multi-Resource Fairness: Life is not fair, but with a little help, your computer system can be, ensuring predictable time-sharing between users. However, past work on fair sharing considered a single resource (e.g. CPU), while cluster applications have demands across multiple resources (memory, IO, CPU, etc). Dominant resource fairness generalizes max-min fairness for this case. (NSDI’11) (SIGCOMM’12)

MapReduce Scheduling: I’ve worked on several scheduling algorithms for MapReduce, including the LATE algorithm for straggler mitigation (OSDI’08) and delay scheduling for data locality (Eurosys’10). Both algorithms are now included in Hadoop. I also developed the Hadoop Fair Scheduler.

SNAP Sequence Aligner: To tackle the growing volume of genomic data, SNAP is a new sequence alignment algorithm that is 10-100x faster than current tools and also more accurate. (homepage) (arXiv)

To learn more about my graduate research, you can also read my job application materials.

Publications

2015

M. Armbrust, R. Xin, C. Lian, Y. Huai, D. Liu, J. Bradley, X. Meng, T. Kaftan, M. Franklin, A. Ghodsi and M. Zaharia. Spark SQL: Relational Data Processing in Spark. To appear in SIGMOD 2015.

2014

H. Li, A. Ghodsi, M. Zaharia, S. Shenker and I. Stoica, Tachyon: Reliable, Memory Speed Storage for Cluster Computing Frameworks, SOCC 2014, November 2014.
S.N. Naccache, S. Federman, N. Veeeraraghavan, M. Zaharia, D. Lee, E. Samayoa, J. Bouquet, A.L. Greninger, K. Luk, B. Enge, D.A. Wadford, S.L. Messenger, G.L. Genrich, K. Pellegrino, G. Grard, E. Leroy, B.S. Schneider, J.N. Fair, M.A. Martinez, P. Isa, J.A. Crump, J.L. DeRisi, T. Sittler, J. Hackett Jr., S. Miller and C.Y. Chiu, A Cloud-Compatible Bioinformatics Pipeline for Ultrarapid Pathogen Identification from Next-Generation Sequencing of Clinical Samples, Genome Research, June 2014.

2013

M. Zaharia. An Architecture for Fast and General Data Processing on Large Clusters (PhD Disseration).
M. Zaharia, T. Das, H. Li, T. Hunter, S. Shenker, and I. Stoica. Discretized Streams: Fault-Tolerant Streaming Computation at Scale, SOSP 2013, November 2013.
K. Ousterhout, P. Wendell, M. Zaharia and I. Stoica. Sparrow: Distributed, Low-Latency Scheduling, SOSP 2013, November 2013.
R. Xin, J. Rosen, M. Zaharia, M. Franklin, S. Shenker, and I. Stoica. Shark: SQL and Rich Analytics at Scale, SIGMOD 2013, June 2013.
A. Ghodsi, M. Zaharia, S. Shenker and I. Stoica. Choosy: Max-Min Fair Sharing for Datacenter Jobs with Constraints, EuroSys 2013, April 2013.

2012

A. Ghodsi, V. Sekar, M. Zaharia and I. Stoica. Multi-Resource Fair Queueing for Packet Processing, SIGCOMM 2012, August 2012. Best Paper Award.
M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M.J. Franklin, S. Shenker, I. Stoica. Fast and Interactive Analytics over Hadoop Data with Spark, USENIX ;login:, August 2012.
M. Zaharia, T. Das, H. Li, S. Shenker and I. Stoica. Discretized Streams: An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters, HotCloud 2012, June 2012.
L. Martignoni, P. Poosankam, M. Zaharia, J. Han, S. McCamant, D. Song, V. Paxson, A. Perrig, S. Shenker, I. Stoica. Cloud Terminal: Secure Access to Sensitive Applications from Untrusted Systems, USENIX ATC 2012, June 2012.
C. Engle, A. Lupher, R. Xin, M. Zaharia, M. Franklin, S. Shenker, I. Stoica. Shark: Fast Data Analysis Using Coarse-grained Distributed Memory (demo), SIGMOD 2012, May 2012. Best Demo Award.
M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M.J. Franklin, S. Shenker, I. Stoica. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing, NSDI 2012, April 2012. Best Paper Award and Honorable Mention for Community Award.

2011

T. Hunter, T. Moldovan, M. Zaharia, S. Merzgui, J. Ma, M.J. Franklin, P. Abbeel, and A.M. Bayen. Scaling the Mobile Millennium System in the Cloud, SOCC 2011, October 2011.
M. Chowdhury, M. Zaharia, J. Ma, M.I. Jordan and I. Stoica, Managing Data Transfers in Computer Clusters with Orchestra, SIGCOMM 2011, August 2011.
B. Hindman, A. Konwinski, M. Zaharia, A. Ghodsi, A.D. Joseph, R. Katz, S. Shenker and I. Stoica, Mesos: Flexible Resource Sharing for the Cloud, USENIX ;login:, August 2011.
M. Zaharia, B. Hindman, A. Konwinski, A. Ghodsi, A.D. Joseph, R. Katz, S. Shenker and I. Stoica, The Datacenter Needs an Operating System, HotCloud 2011, June 2011.
B. Hindman, A. Konwinski, M. Zaharia, A. Ghodsi, A.D. Joseph, R. Katz, S. Shenker and I. Stoica, Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center, NSDI 2011, March 2011.
A. Ghodsi, M. Zaharia, B. Hindman, A. Konwinski, S. Shenker, and I. Stoica, Dominant Resource Fairness: Fair Allocation of Multiple Resources Types, NSDI 2011, March 2011.

2010

M. Zaharia, M. Chowdhury, M.J. Franklin, S. Shenker and I. Stoica. Spark: Cluster Computing with Working Sets, HotCloud 2010, June 2010.
M. Zaharia, D. Borthakur, J. Sen Sarma, K. Elmeleegy, S. Shenker and I. Stoica. Delay Scheduling: A Simple Technique for Achieving Locality and Fairness in Cluster Scheduling, EuroSys 2010, April 2010.
M. Armbrust, A. Fox, R. Griffith, A.D. Joseph, R.H. Katz, A. Konwinski, G. Lee, D.A. Patterson, A. Rabkin, I. Stoica and M. Zaharia, Above the Clouds: A View of Cloud Computing, Communications of the ACM, April 2010.
S. Guo, M. Derakhshani, M.H. Falaki, U. Ismail, R. Luk, E.A. Oliver, S. Ur Rahman, A. Seth, M.A. Zaharia, S. Keshav, Design and Implementation of the KioskNet System, Computer Networks, ISSN 1389-1286, DOI: 10.1016/j.comnet.2010.08.001

Earlier

B. Hindman, A. Konwinski, M. Zaharia and I. Stoica, A Common Substrate for Cluster Computing, HotCloud 2009, June 2009.
R. Luk, M. Zaharia, M. Ho, B. Levine and P. Aoki, ICTD for Healthcare in Ghana: Two Parallel Case Studies, ICTD 2009, April 2009.
M. Zaharia, A. Konwinski, A.D. Joseph, R. Katz and I. Stoica, Improving MapReduce Performance in Heterogeneous Environments, OSDI 2008, December 2008.
S. Guo, M.H. Falaki, E.A. Oliver, S. Ur Rahman, A. Seth, M. Zaharia, U. Ismail, and S. Keshav, Design and Implementation of the KioskNet System, ICTD 2007, December 2007.
S. Guo, M.H. Falaki, E.A. Oliver, S. Ur Rahman, A. Seth, M. Zaharia, and S. Keshav, Very Low-Cost Internet Access Using KioskNet, ACM Computer Communication Review, October 2007.
M. Zaharia and S. Keshav, Gossip-based Search Selection in Hybrid Peer-to-Peer Networks, J. Concurrency and Computation: Practice and Experience, 2007.
M. Zaharia, A. Chandel, S. Saroiu, and S. Keshav, Finding Content in File-Sharing Networks When You Can’t Even Spell, Proc. IPTPS, February 2007.
A. Seth, D. Kroeker, M. Zaharia, S. Guo, S. Keshav, Low-cost Communication for Rural Internet Kiosks Using Mechanical Backhaul, Proc. MOBICOM 2006, September 2006.
M. Zaharia and S. Keshav, Gossip-Based Search Selection in Hybrid Peer-to-Peer Networks, Proc. IPTPS, February 2006.

Full Publication List and Technical Reports

Talks

Spark and Shark: High-Speed In-Memory Analytics over Hadoop and Hive Data (pptx, pdf), Hadoop Summit 2012, San Jose, CA, June 2012.
Discretized Streams: An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters (pptx, pdf), HotCloud 2012, Boston, MA, June 2012.
Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing (pptx, pdf), NSDI 2012, San Jose, CA, April 2012.
Spark: In-Memory Cluster Computing for Iterative and Interactive Applications (machine learning focused version) (pptx, pdf), NIPS Big Learning Workshop, Sierra Nevada, Spain, December 2011.
Spark: In-Memory Cluster Computing for Iterative and Interactive Applications (pptx, pdf), Google Inc, Mountain View, CA, October 2011.
The Datacenter Needs an Operating System (ppt, pdf) HotCloud 2011, Portland, OR, June 2011.
Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center, (ppt, pdf), NSDI 2011, Boston, MA, March 2011.
Spark: In-Memory Cluster Computing for Iterative and Interactive Applications (ppt, pdf), Stanford University, Stanford, CA, February 2011.
Spark: Cluster Computing with Working Sets (ppt, pdf), HotCloud 2010, Boston, MA, June 2010.
Delay Scheduling: A Simple Technique for Achieving Locality and Fairness in Cluster Scheduling (ppt, pdf), Eurosys 2010, Paris, France, April 2010.
Job Scheduling with the Fair and Capacity Schedulers (ppt, pdf), Hadoop Summit 2009, Santa Clara, CA, June 2009.
Job Scheduling for MapReduce (ppt, pdf), Microsoft Research Silicon Valley, Mountain View, CA, January 2009.
Improving MapReduce Performance in Heterogeneous Environments (ppt, pdf), OSDI 2008, San Diego, CA, December 2008.

Open Source

Almost all of my work is open source:

The Spark cluster computing framework is now an Apache project at spark.apache.org. We have also open sourced Shark, our Apache Hive compatible SQL and analytics engine built on Spark.
The Mesos cluster manager is a top-level Apache project.
The LATE algorithm for straggler mitigation and the Hadoop Fair Scheduler are included in Apache Hadoop.
The SNAP sequence aligner is available on GitHub.

I’m also a committer on the Apache Hadoop, Spark and Mesos projects.

Adapted from a template by Andreas Viklund.