Matei Zaharia
I recently started an assistant professor position in CSAIL at MIT, where I work on computer systems and big data. Before that, I obtained my PhD at UC Berkeley. I’m also co-founder and CTO of Databricks, the big data company commercializing Apache Spark.
You can contact me at matei@mit.edu or find me in the Stata center, office 32G-996.
Projects
I work on systems and algorithms for large-scale data-intensive computing. My projects include:
Spark: As big data analytics evolves beyond simple batch jobs, there is a need for both more complex multi-stage applications (e.g. machine learning algorithms) and more interactive ad-hoc queries. Spark provides an efficient abstraction for in-memory cluster computing called Resilient Distributed Datasets, and can run 100x faster than Hadoop for these applications. (homepage) (short paper) (NSDI’12 paper)
Shark: This high-speed query engine runs Hive SQL queries on top of Spark up to 100x faster than Hive, and supports fault recovery and complex analytics (e.g. machine learning). (homepage) (SIGMOD’13)
Mesos: Clusters are running increasingly diverse applications, from batch jobs to interactive services. Mesos is a cluster manager that efficiently supports diverse applications by letting them control their own scheduling. The project is open source in the Apache Incubator. (homepage) (NSDI’11 paper)
Multi-Resource Fairness: Life is not fair, but with a little help, your computer system can be, ensuring predictable time-sharing between users. However, past work on fair sharing considered a single resource (e.g. CPU), while cluster applications have demands across multiple resources (memory, IO, CPU, etc). Dominant resource fairness generalizes max-min fairness for this case. (NSDI’11) (SIGCOMM’12)
MapReduce Scheduling: I’ve worked on several scheduling algorithms for MapReduce, including the LATE algorithm for straggler mitigation (OSDI’08) and delay scheduling for data locality (Eurosys’10). Both algorithms are now included in Hadoop. I also developed the Hadoop Fair Scheduler.
SNAP Sequence Aligner: To tackle the growing volume of genomic data, SNAP is a new sequence alignment algorithm that is 10-100x faster than current tools and also more accurate. (homepage) (arXiv)
To learn more about my graduate research, you can also read my job application materials.
Publications
2015
- M. Armbrust, R. Xin, C. Lian, Y. Huai, D. Liu, J. Bradley, X. Meng, T. Kaftan, M. Franklin, A. Ghodsi and M. Zaharia. Spark SQL: Relational Data Processing in Spark. To appear in SIGMOD 2015.
2014
- H. Li, A. Ghodsi, M. Zaharia, S. Shenker and I. Stoica, Tachyon: Reliable, Memory Speed Storage for Cluster Computing Frameworks, SOCC 2014, November 2014.
- S.N. Naccache, S. Federman, N. Veeeraraghavan, M. Zaharia, D. Lee, E. Samayoa, J. Bouquet, A.L. Greninger, K. Luk, B. Enge, D.A. Wadford, S.L. Messenger, G.L. Genrich, K. Pellegrino, G. Grard, E. Leroy, B.S. Schneider, J.N. Fair, M.A. Martinez, P. Isa, J.A. Crump, J.L. DeRisi, T. Sittler, J. Hackett Jr., S. Miller and C.Y. Chiu, A Cloud-Compatible Bioinformatics Pipeline for Ultrarapid Pathogen Identification from Next-Generation Sequencing of Clinical Samples, Genome Research, June 2014.
2013
- M. Zaharia. An Architecture for Fast and General Data Processing on Large Clusters (PhD Disseration).
- M. Zaharia, T. Das, H. Li, T. Hunter, S. Shenker, and I. Stoica. Discretized Streams: Fault-Tolerant Streaming Computation at Scale, SOSP 2013, November 2013.
- K. Ousterhout, P. Wendell, M. Zaharia and I. Stoica. Sparrow: Distributed, Low-Latency Scheduling, SOSP 2013, November 2013.
- R. Xin, J. Rosen, M. Zaharia, M. Franklin, S. Shenker, and I. Stoica. Shark: SQL and Rich Analytics at Scale, SIGMOD 2013, June 2013.
- A. Ghodsi, M. Zaharia, S. Shenker and I. Stoica. Choosy: Max-Min Fair Sharing for Datacenter Jobs with Constraints, EuroSys 2013, April 2013.
2012
- A. Ghodsi, V. Sekar, M. Zaharia and I. Stoica. Multi-Resource Fair Queueing for Packet Processing, SIGCOMM 2012, August 2012. Best Paper Award.
- M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M.J. Franklin, S. Shenker, I. Stoica. Fast and Interactive Analytics over Hadoop Data with Spark, USENIX ;login:, August 2012.
- M. Zaharia, T. Das, H. Li, S. Shenker and I. Stoica. Discretized Streams: An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters, HotCloud 2012, June 2012.
- L. Martignoni, P. Poosankam, M. Zaharia, J. Han, S. McCamant, D. Song, V. Paxson, A. Perrig, S. Shenker, I. Stoica. Cloud Terminal: Secure Access to Sensitive Applications from Untrusted Systems, USENIX ATC 2012, June 2012.
- C. Engle, A. Lupher, R. Xin, M. Zaharia, M. Franklin, S. Shenker, I. Stoica. Shark: Fast Data Analysis Using Coarse-grained Distributed Memory (demo), SIGMOD 2012, May 2012. Best Demo Award.
- M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M.J. Franklin, S. Shenker, I. Stoica. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing, NSDI 2012, April 2012. Best Paper Award and Honorable Mention for Community Award.
2011
- T. Hunter, T. Moldovan, M. Zaharia, S. Merzgui, J. Ma, M.J. Franklin, P. Abbeel, and A.M. Bayen. Scaling the Mobile Millennium System in the Cloud, SOCC 2011, October 2011.
- M. Chowdhury, M. Zaharia, J. Ma, M.I. Jordan and I. Stoica, Managing Data Transfers in Computer Clusters with Orchestra, SIGCOMM 2011, August 2011.
- B. Hindman, A. Konwinski, M. Zaharia, A. Ghodsi, A.D. Joseph, R. Katz, S. Shenker and I. Stoica, Mesos: Flexible Resource Sharing for the Cloud, USENIX ;login:, August 2011.
- M. Zaharia, B. Hindman, A. Konwinski, A. Ghodsi, A.D. Joseph, R. Katz, S. Shenker and I. Stoica, The Datacenter Needs an Operating System, HotCloud 2011, June 2011.
- B. Hindman, A. Konwinski, M. Zaharia, A. Ghodsi, A.D. Joseph, R. Katz, S. Shenker and I. Stoica, Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center, NSDI 2011, March 2011.
- A. Ghodsi, M. Zaharia, B. Hindman, A. Konwinski, S. Shenker, and I. Stoica, Dominant Resource Fairness: Fair Allocation of Multiple Resources Types, NSDI 2011, March 2011.
2010
- M. Zaharia, M. Chowdhury, M.J. Franklin, S. Shenker and I. Stoica. Spark: Cluster Computing with Working Sets, HotCloud 2010, June 2010.
- M. Zaharia, D. Borthakur, J. Sen Sarma, K. Elmeleegy, S. Shenker and I. Stoica. Delay Scheduling: A Simple Technique for Achieving Locality and Fairness in Cluster Scheduling, EuroSys 2010, April 2010.
- M. Armbrust, A. Fox, R. Griffith, A.D. Joseph, R.H. Katz, A. Konwinski, G. Lee, D.A. Patterson, A. Rabkin, I. Stoica and M. Zaharia, Above the Clouds: A View of Cloud Computing, Communications of the ACM, April 2010.
- S. Guo, M. Derakhshani, M.H. Falaki, U. Ismail, R. Luk, E.A. Oliver, S. Ur Rahman, A. Seth, M.A. Zaharia, S. Keshav, Design and Implementation of the KioskNet System, Computer Networks, ISSN 1389-1286, DOI: 10.1016/j.comnet.2010.08.001
Earlier
- B. Hindman, A. Konwinski, M. Zaharia and I. Stoica, A Common Substrate for Cluster Computing, HotCloud 2009, June 2009.
- R. Luk, M. Zaharia, M. Ho, B. Levine and P. Aoki, ICTD for Healthcare in Ghana: Two Parallel Case Studies, ICTD 2009, April 2009.
- M. Zaharia, A. Konwinski, A.D. Joseph, R. Katz and I. Stoica, Improving MapReduce Performance in Heterogeneous Environments, OSDI 2008, December 2008.
- S. Guo, M.H. Falaki, E.A. Oliver, S. Ur Rahman, A. Seth, M. Zaharia, U. Ismail, and S. Keshav, Design and Implementation of the KioskNet System, ICTD 2007, December 2007.
- S. Guo, M.H. Falaki, E.A. Oliver, S. Ur Rahman, A. Seth, M. Zaharia, and S. Keshav, Very Low-Cost Internet Access Using KioskNet, ACM Computer Communication Review, October 2007.
- M. Zaharia and S. Keshav, Gossip-based Search Selection in Hybrid Peer-to-Peer Networks, J. Concurrency and Computation: Practice and Experience, 2007.
- M. Zaharia, A. Chandel, S. Saroiu, and S. Keshav, Finding Content in File-Sharing Networks When You Can’t Even Spell, Proc. IPTPS, February 2007.
- A. Seth, D. Kroeker, M. Zaharia, S. Guo, S. Keshav, Low-cost Communication for Rural Internet Kiosks Using Mechanical Backhaul, Proc. MOBICOM 2006, September 2006.
- M. Zaharia and S. Keshav, Gossip-Based Search Selection in Hybrid Peer-to-Peer Networks, Proc. IPTPS, February 2006.
Full Publication List and Technical Reports
Talks
- Spark and Shark: High-Speed In-Memory Analytics over Hadoop and Hive Data (pptx, pdf), Hadoop Summit 2012, San Jose, CA, June 2012.
- Discretized Streams: An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters (pptx, pdf), HotCloud 2012, Boston, MA, June 2012.
- Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing (pptx, pdf), NSDI 2012, San Jose, CA, April 2012.
- Spark: In-Memory Cluster Computing for Iterative and Interactive Applications (machine learning focused version) (pptx, pdf), NIPS Big Learning Workshop, Sierra Nevada, Spain, December 2011.
- Spark: In-Memory Cluster Computing for Iterative and Interactive Applications (pptx, pdf), Google Inc, Mountain View, CA, October 2011.
- The Datacenter Needs an Operating System (ppt, pdf) HotCloud 2011, Portland, OR, June 2011.
- Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center, (ppt, pdf), NSDI 2011, Boston, MA, March 2011.
- Spark: In-Memory Cluster Computing for Iterative and Interactive Applications (ppt, pdf), Stanford University, Stanford, CA, February 2011.
- Spark: Cluster Computing with Working Sets (ppt, pdf), HotCloud 2010, Boston, MA, June 2010.
- Delay Scheduling: A Simple Technique for Achieving Locality and Fairness in Cluster Scheduling (ppt, pdf), Eurosys 2010, Paris, France, April 2010.
- Job Scheduling with the Fair and Capacity Schedulers (ppt, pdf), Hadoop Summit 2009, Santa Clara, CA, June 2009.
- Job Scheduling for MapReduce (ppt, pdf), Microsoft Research Silicon Valley, Mountain View, CA, January 2009.
- Improving MapReduce Performance in Heterogeneous Environments (ppt, pdf), OSDI 2008, San Diego, CA, December 2008.
Open Source
Almost all of my work is open source:
- The Spark cluster computing framework is now an Apache project at spark.apache.org. We have also open sourced Shark, our Apache Hive compatible SQL and analytics engine built on Spark.
- The Mesos cluster manager is a top-level Apache project.
- The LATE algorithm for straggler mitigation and the Hadoop Fair Scheduler are included in Apache Hadoop.
- The SNAP sequence aligner is available on GitHub.
I’m also a committer on the Apache Hadoop, Spark and Mesos projects.
Adapted from a template by Andreas Viklund.