Apache Spark is an open-source cluster computing framework. Its primary purpose is to handle the real-time generated data.
- Spark was built on the top of the Hadoop MapReduce.
- It was optimized to run in memory whereas alternative approaches like Hadoop's MapReduce writes data to and from computer hard drives.
- So, Spark processes the data much quicker than other alternatives.
- Fast - It provides high performance for both batch and streaming data, using a state-of-the-art DAG scheduler, a query optimizer, and a physical execution engine.
- Easy to Use - It facilitates writing applications in Java, Scala, Python, R, and SQL. It also provides more than 80 high-level operators.
- Generality - It provides a collection of libraries including SQL and DataFrames, MLlib for machine learning, GraphX, and Spark Streaming.
- Lightweight - It is a light unified analytics engine which is used for large scale data processing.
- Runs Everywhere - It can easily run on Hadoop, Apache Mesos, Kubernetes, standalone, or in the cloud.
Before starting to learn Apache Spark, it's recommended to understand some basic concepts about Big Data and Hadoop. If you haven't explored these concepts yet, Go through the link below for basic fundamentals.
Follow
Mohammad Azzam
for more such content on Spark and Data Engineering Concepts.