SlideShare a Scribd company logo
Introduction to Spark
That’s what I do
I drink and I join data
- Tyrion (Imp. Data Engineer @ Pet a Dragon Inc.)
What and Why of Spark?
What and Why of Spark?
Need: Resilient computation of data
For, the computation of data to be resilient.
Hadoop’s mapreduce paradigm stores each map output to local disk.
And, Each map-reduce task output to HDFS.
Spark solves this problem
By introducing RDDs (Resilient Distributed Datasets)
http://people.csail.mit.edu/matei/papers/2012/nsdi_spark.pdf
Easy introduction to RDDs
RDDs are like list. RDD[String], RDD[Int]
But,
● In-Memory
● Lazy Evaluations (Spark is like an engineering student.)
● Immutable and Read-only
● Cacheable or Persistence
● Partitioned
● Parallel
● Fault Tolerance
● Location Stickiness
● Typed
Operations are applied per partition
RDDs are immutable
RDDs are lazy engineering students
How does RDDs achieve that?
That is a story for another time.
RDD hands on
DataFrame
Introduction to Apache Spark
Dataframes hands-on
File formats
File formats - Optimised Row Columnar(ORC)
Features:
● Row splits(stripes), each split uses column-oriented storage
● Light-weight, always on compression
● Query performance , returns only required fields
● Specifically designed for hive
● Limited schema evolution
● Self describing
File formats - Parquet
Features:
● Columnar storage
● Supports limited schema evolution
● Self documenting: meta-data is stored with data
● Supports data partitioning
● Query performance, returns only required fields
File formats - Avro
Features:
● Row based data serialization system
● Schema Evolution
● Self describing data - Schema in JSON and Data in binary
● Splittable
● Sync markers that can be used to split large datasets into subsets
Introduction to Apache Spark
Ad

More Related Content

Similar to Introduction to Apache Spark (20)

Spark learning
Spark learningSpark learning
Spark learning
Ajay Guyyala
 
Apache Spark Fundamentals Meetup Talk
Apache Spark Fundamentals Meetup TalkApache Spark Fundamentals Meetup Talk
Apache Spark Fundamentals Meetup Talk
Eren Avşaroğulları
 
Spark_RDD_SyedAcademy
Spark_RDD_SyedAcademySpark_RDD_SyedAcademy
Spark_RDD_SyedAcademy
Syed Hadoop
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
Databricks
 
Spark
SparkSpark
Spark
Mário Almeida
 
Spark
SparkSpark
Spark
Knoldus Inc.
 
Algorithm Analytics Anomaly Detection Artificial Intelligence (AI) Big Data
Algorithm Analytics Anomaly Detection Artificial Intelligence (AI) Big DataAlgorithm Analytics Anomaly Detection Artificial Intelligence (AI) Big Data
Algorithm Analytics Anomaly Detection Artificial Intelligence (AI) Big Data
Gabriel Kamau
 
Apache Spark Introduction.pdf
Apache Spark Introduction.pdfApache Spark Introduction.pdf
Apache Spark Introduction.pdf
MaheshPandit16
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
DataArt
 
Learning Apache Spark by examples
Learning Apache Spark by examplesLearning Apache Spark by examples
Learning Apache Spark by examples
Samuel Yee
 
Stefano Baghino - From Big Data to Fast Data: Apache Spark
Stefano Baghino - From Big Data to Fast Data: Apache SparkStefano Baghino - From Big Data to Fast Data: Apache Spark
Stefano Baghino - From Big Data to Fast Data: Apache Spark
Codemotion
 
Map reduce vs spark
Map reduce vs sparkMap reduce vs spark
Map reduce vs spark
Tudor Lapusan
 
Learn about SPARK tool and it's componemts
Learn about SPARK tool and it's componemtsLearn about SPARK tool and it's componemts
Learn about SPARK tool and it's componemts
siddharth30121
 
Spark vstez
Spark vstezSpark vstez
Spark vstez
David Groozman
 
Big Data Processing using Apache Spark and Clojure
Big Data Processing using Apache Spark and ClojureBig Data Processing using Apache Spark and Clojure
Big Data Processing using Apache Spark and Clojure
Dr. Christian Betz
 
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesSpark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student Slides
Databricks
 
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015 Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Sameer Farooqui
 
Introduction to Spark: Or how I learned to love 'big data' after all.
Introduction to Spark: Or how I learned to love 'big data' after all.Introduction to Spark: Or how I learned to love 'big data' after all.
Introduction to Spark: Or how I learned to love 'big data' after all.
Peadar Coyle
 
New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015
Robbie Strickland
 
Spark 101
Spark 101Spark 101
Spark 101
Shahaf Azriely {TopLinked} ☁
 
Apache Spark Fundamentals Meetup Talk
Apache Spark Fundamentals Meetup TalkApache Spark Fundamentals Meetup Talk
Apache Spark Fundamentals Meetup Talk
Eren Avşaroğulları
 
Spark_RDD_SyedAcademy
Spark_RDD_SyedAcademySpark_RDD_SyedAcademy
Spark_RDD_SyedAcademy
Syed Hadoop
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
Databricks
 
Algorithm Analytics Anomaly Detection Artificial Intelligence (AI) Big Data
Algorithm Analytics Anomaly Detection Artificial Intelligence (AI) Big DataAlgorithm Analytics Anomaly Detection Artificial Intelligence (AI) Big Data
Algorithm Analytics Anomaly Detection Artificial Intelligence (AI) Big Data
Gabriel Kamau
 
Apache Spark Introduction.pdf
Apache Spark Introduction.pdfApache Spark Introduction.pdf
Apache Spark Introduction.pdf
MaheshPandit16
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
DataArt
 
Learning Apache Spark by examples
Learning Apache Spark by examplesLearning Apache Spark by examples
Learning Apache Spark by examples
Samuel Yee
 
Stefano Baghino - From Big Data to Fast Data: Apache Spark
Stefano Baghino - From Big Data to Fast Data: Apache SparkStefano Baghino - From Big Data to Fast Data: Apache Spark
Stefano Baghino - From Big Data to Fast Data: Apache Spark
Codemotion
 
Learn about SPARK tool and it's componemts
Learn about SPARK tool and it's componemtsLearn about SPARK tool and it's componemts
Learn about SPARK tool and it's componemts
siddharth30121
 
Big Data Processing using Apache Spark and Clojure
Big Data Processing using Apache Spark and ClojureBig Data Processing using Apache Spark and Clojure
Big Data Processing using Apache Spark and Clojure
Dr. Christian Betz
 
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesSpark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student Slides
Databricks
 
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015 Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Sameer Farooqui
 
Introduction to Spark: Or how I learned to love 'big data' after all.
Introduction to Spark: Or how I learned to love 'big data' after all.Introduction to Spark: Or how I learned to love 'big data' after all.
Introduction to Spark: Or how I learned to love 'big data' after all.
Peadar Coyle
 
New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015
Robbie Strickland
 

Recently uploaded (20)

L1_Slides_Foundational Concepts_508.pptx
L1_Slides_Foundational Concepts_508.pptxL1_Slides_Foundational Concepts_508.pptx
L1_Slides_Foundational Concepts_508.pptx
38NoopurPatel
 
Ann Naser Nabil- Data Scientist Portfolio.pdf
Ann Naser Nabil- Data Scientist Portfolio.pdfAnn Naser Nabil- Data Scientist Portfolio.pdf
Ann Naser Nabil- Data Scientist Portfolio.pdf
আন্ নাসের নাবিল
 
Process Mining as Enabler for Digital Transformations
Process Mining as Enabler for Digital TransformationsProcess Mining as Enabler for Digital Transformations
Process Mining as Enabler for Digital Transformations
Process mining Evangelist
 
AWS-Certified-ML-Engineer-Associate-Slides.pdf
AWS-Certified-ML-Engineer-Associate-Slides.pdfAWS-Certified-ML-Engineer-Associate-Slides.pdf
AWS-Certified-ML-Engineer-Associate-Slides.pdf
philsparkshome
 
50_questions_full.pptxdddddddddddddddddd
50_questions_full.pptxdddddddddddddddddd50_questions_full.pptxdddddddddddddddddd
50_questions_full.pptxdddddddddddddddddd
emir73065
 
Z14_IBM__APL_by_Christian_Demmer_IBM.pdf
Z14_IBM__APL_by_Christian_Demmer_IBM.pdfZ14_IBM__APL_by_Christian_Demmer_IBM.pdf
Z14_IBM__APL_by_Christian_Demmer_IBM.pdf
Fariborz Seyedloo
 
Fundamentals of Data Analysis, its types, tools, algorithms
Fundamentals of Data Analysis, its types, tools, algorithmsFundamentals of Data Analysis, its types, tools, algorithms
Fundamentals of Data Analysis, its types, tools, algorithms
priyaiyerkbcsc
 
AWS RDS Presentation to make concepts easy.pptx
AWS RDS Presentation to make concepts easy.pptxAWS RDS Presentation to make concepts easy.pptx
AWS RDS Presentation to make concepts easy.pptx
bharatkumarbhojwani
 
Publication-launch-How-is-Life-for-Children-in-the-Digital-Age-15-May-2025.pdf
Publication-launch-How-is-Life-for-Children-in-the-Digital-Age-15-May-2025.pdfPublication-launch-How-is-Life-for-Children-in-the-Digital-Age-15-May-2025.pdf
Publication-launch-How-is-Life-for-Children-in-the-Digital-Age-15-May-2025.pdf
StatsCommunications
 
Analysis of Billboards hot 100 toop five hit makers on the chart.docx
Analysis of Billboards hot 100 toop five hit makers on the chart.docxAnalysis of Billboards hot 100 toop five hit makers on the chart.docx
Analysis of Billboards hot 100 toop five hit makers on the chart.docx
hershtara1
 
Feature Engineering for Electronic Health Record Systems
Feature Engineering for Electronic Health Record SystemsFeature Engineering for Electronic Health Record Systems
Feature Engineering for Electronic Health Record Systems
Process mining Evangelist
 
文凭证书美国SDSU文凭圣地亚哥州立大学学生证学历认证查询
文凭证书美国SDSU文凭圣地亚哥州立大学学生证学历认证查询文凭证书美国SDSU文凭圣地亚哥州立大学学生证学历认证查询
文凭证书美国SDSU文凭圣地亚哥州立大学学生证学历认证查询
Taqyea
 
录取通知书加拿大TMU毕业证多伦多都会大学电子版毕业证成绩单
录取通知书加拿大TMU毕业证多伦多都会大学电子版毕业证成绩单录取通知书加拿大TMU毕业证多伦多都会大学电子版毕业证成绩单
录取通知书加拿大TMU毕业证多伦多都会大学电子版毕业证成绩单
Taqyea
 
Multi-tenant Data Pipeline Orchestration
Multi-tenant Data Pipeline OrchestrationMulti-tenant Data Pipeline Orchestration
Multi-tenant Data Pipeline Orchestration
Romi Kuntsman
 
problem solving.presentation slideshow bsc nursing
problem solving.presentation slideshow bsc nursingproblem solving.presentation slideshow bsc nursing
problem solving.presentation slideshow bsc nursing
vishnudathas123
 
TOAE201-Slides-Chapter 4. Sample theoretical basis (1).pdf
TOAE201-Slides-Chapter 4. Sample theoretical basis (1).pdfTOAE201-Slides-Chapter 4. Sample theoretical basis (1).pdf
TOAE201-Slides-Chapter 4. Sample theoretical basis (1).pdf
NhiV747372
 
Introduction to systems thinking tools_Eng.pdf
Introduction to systems thinking tools_Eng.pdfIntroduction to systems thinking tools_Eng.pdf
Introduction to systems thinking tools_Eng.pdf
AbdurahmanAbd
 
2-Raction quotient_١٠٠١٤٦.ppt of physical chemisstry
2-Raction quotient_١٠٠١٤٦.ppt of physical chemisstry2-Raction quotient_١٠٠١٤٦.ppt of physical chemisstry
2-Raction quotient_١٠٠١٤٦.ppt of physical chemisstry
bastakwyry
 
RAG Chatbot using AWS Bedrock and Streamlit Framework
RAG Chatbot using AWS Bedrock and Streamlit FrameworkRAG Chatbot using AWS Bedrock and Streamlit Framework
RAG Chatbot using AWS Bedrock and Streamlit Framework
apanneer
 
indonesia-gen-z-report-2024 Gen Z (born between 1997 and 2012) is currently t...
indonesia-gen-z-report-2024 Gen Z (born between 1997 and 2012) is currently t...indonesia-gen-z-report-2024 Gen Z (born between 1997 and 2012) is currently t...
indonesia-gen-z-report-2024 Gen Z (born between 1997 and 2012) is currently t...
disnakertransjabarda
 
L1_Slides_Foundational Concepts_508.pptx
L1_Slides_Foundational Concepts_508.pptxL1_Slides_Foundational Concepts_508.pptx
L1_Slides_Foundational Concepts_508.pptx
38NoopurPatel
 
Process Mining as Enabler for Digital Transformations
Process Mining as Enabler for Digital TransformationsProcess Mining as Enabler for Digital Transformations
Process Mining as Enabler for Digital Transformations
Process mining Evangelist
 
AWS-Certified-ML-Engineer-Associate-Slides.pdf
AWS-Certified-ML-Engineer-Associate-Slides.pdfAWS-Certified-ML-Engineer-Associate-Slides.pdf
AWS-Certified-ML-Engineer-Associate-Slides.pdf
philsparkshome
 
50_questions_full.pptxdddddddddddddddddd
50_questions_full.pptxdddddddddddddddddd50_questions_full.pptxdddddddddddddddddd
50_questions_full.pptxdddddddddddddddddd
emir73065
 
Z14_IBM__APL_by_Christian_Demmer_IBM.pdf
Z14_IBM__APL_by_Christian_Demmer_IBM.pdfZ14_IBM__APL_by_Christian_Demmer_IBM.pdf
Z14_IBM__APL_by_Christian_Demmer_IBM.pdf
Fariborz Seyedloo
 
Fundamentals of Data Analysis, its types, tools, algorithms
Fundamentals of Data Analysis, its types, tools, algorithmsFundamentals of Data Analysis, its types, tools, algorithms
Fundamentals of Data Analysis, its types, tools, algorithms
priyaiyerkbcsc
 
AWS RDS Presentation to make concepts easy.pptx
AWS RDS Presentation to make concepts easy.pptxAWS RDS Presentation to make concepts easy.pptx
AWS RDS Presentation to make concepts easy.pptx
bharatkumarbhojwani
 
Publication-launch-How-is-Life-for-Children-in-the-Digital-Age-15-May-2025.pdf
Publication-launch-How-is-Life-for-Children-in-the-Digital-Age-15-May-2025.pdfPublication-launch-How-is-Life-for-Children-in-the-Digital-Age-15-May-2025.pdf
Publication-launch-How-is-Life-for-Children-in-the-Digital-Age-15-May-2025.pdf
StatsCommunications
 
Analysis of Billboards hot 100 toop five hit makers on the chart.docx
Analysis of Billboards hot 100 toop five hit makers on the chart.docxAnalysis of Billboards hot 100 toop five hit makers on the chart.docx
Analysis of Billboards hot 100 toop five hit makers on the chart.docx
hershtara1
 
Feature Engineering for Electronic Health Record Systems
Feature Engineering for Electronic Health Record SystemsFeature Engineering for Electronic Health Record Systems
Feature Engineering for Electronic Health Record Systems
Process mining Evangelist
 
文凭证书美国SDSU文凭圣地亚哥州立大学学生证学历认证查询
文凭证书美国SDSU文凭圣地亚哥州立大学学生证学历认证查询文凭证书美国SDSU文凭圣地亚哥州立大学学生证学历认证查询
文凭证书美国SDSU文凭圣地亚哥州立大学学生证学历认证查询
Taqyea
 
录取通知书加拿大TMU毕业证多伦多都会大学电子版毕业证成绩单
录取通知书加拿大TMU毕业证多伦多都会大学电子版毕业证成绩单录取通知书加拿大TMU毕业证多伦多都会大学电子版毕业证成绩单
录取通知书加拿大TMU毕业证多伦多都会大学电子版毕业证成绩单
Taqyea
 
Multi-tenant Data Pipeline Orchestration
Multi-tenant Data Pipeline OrchestrationMulti-tenant Data Pipeline Orchestration
Multi-tenant Data Pipeline Orchestration
Romi Kuntsman
 
problem solving.presentation slideshow bsc nursing
problem solving.presentation slideshow bsc nursingproblem solving.presentation slideshow bsc nursing
problem solving.presentation slideshow bsc nursing
vishnudathas123
 
TOAE201-Slides-Chapter 4. Sample theoretical basis (1).pdf
TOAE201-Slides-Chapter 4. Sample theoretical basis (1).pdfTOAE201-Slides-Chapter 4. Sample theoretical basis (1).pdf
TOAE201-Slides-Chapter 4. Sample theoretical basis (1).pdf
NhiV747372
 
Introduction to systems thinking tools_Eng.pdf
Introduction to systems thinking tools_Eng.pdfIntroduction to systems thinking tools_Eng.pdf
Introduction to systems thinking tools_Eng.pdf
AbdurahmanAbd
 
2-Raction quotient_١٠٠١٤٦.ppt of physical chemisstry
2-Raction quotient_١٠٠١٤٦.ppt of physical chemisstry2-Raction quotient_١٠٠١٤٦.ppt of physical chemisstry
2-Raction quotient_١٠٠١٤٦.ppt of physical chemisstry
bastakwyry
 
RAG Chatbot using AWS Bedrock and Streamlit Framework
RAG Chatbot using AWS Bedrock and Streamlit FrameworkRAG Chatbot using AWS Bedrock and Streamlit Framework
RAG Chatbot using AWS Bedrock and Streamlit Framework
apanneer
 
indonesia-gen-z-report-2024 Gen Z (born between 1997 and 2012) is currently t...
indonesia-gen-z-report-2024 Gen Z (born between 1997 and 2012) is currently t...indonesia-gen-z-report-2024 Gen Z (born between 1997 and 2012) is currently t...
indonesia-gen-z-report-2024 Gen Z (born between 1997 and 2012) is currently t...
disnakertransjabarda
 
Ad

Introduction to Apache Spark

  • 1. Introduction to Spark That’s what I do I drink and I join data - Tyrion (Imp. Data Engineer @ Pet a Dragon Inc.)
  • 2. What and Why of Spark?
  • 3. What and Why of Spark?
  • 4. Need: Resilient computation of data For, the computation of data to be resilient. Hadoop’s mapreduce paradigm stores each map output to local disk. And, Each map-reduce task output to HDFS. Spark solves this problem By introducing RDDs (Resilient Distributed Datasets) http://people.csail.mit.edu/matei/papers/2012/nsdi_spark.pdf
  • 5. Easy introduction to RDDs RDDs are like list. RDD[String], RDD[Int] But, ● In-Memory ● Lazy Evaluations (Spark is like an engineering student.) ● Immutable and Read-only ● Cacheable or Persistence ● Partitioned ● Parallel ● Fault Tolerance ● Location Stickiness ● Typed
  • 6. Operations are applied per partition
  • 8. RDDs are lazy engineering students
  • 9. How does RDDs achieve that? That is a story for another time.
  • 15. File formats - Optimised Row Columnar(ORC) Features: ● Row splits(stripes), each split uses column-oriented storage ● Light-weight, always on compression ● Query performance , returns only required fields ● Specifically designed for hive ● Limited schema evolution ● Self describing
  • 16. File formats - Parquet Features: ● Columnar storage ● Supports limited schema evolution ● Self documenting: meta-data is stored with data ● Supports data partitioning ● Query performance, returns only required fields
  • 17. File formats - Avro Features: ● Row based data serialization system ● Schema Evolution ● Self describing data - Schema in JSON and Data in binary ● Splittable ● Sync markers that can be used to split large datasets into subsets

Editor's Notes

  • #3: Typical map-reduce paradigm, these steps are also common to spark. All the results of map is stored on the disk for safety. Bonus point always avoid shuffling. But, the problem is when you have multi stage map-reduce jobs.
  • #4: For chain of map-reduce each subsequent step’s output is stored on the disk.
  • #5: HDFS distributed file system
  • #6: Spark is like an engineering student. Siddarth took a similar talk A group of engineering students whenever asked for assignments, practical records and what not. They just nod along. Planning for each weekend doing nothing. But, at the moment the professor asks at the end of the semester and threatens them that they will lose marks. At, that moment engineering students actually starts working. Lazy evaluation: To trigger the execution, an action is a must. Immutable and Read-only: RDDs are immutable, which means unchangeable over time. That property helps to maintain consistency when we perform further computations. As we can not make any change in RDD once created, it can only get transformed into new RDDs. This is possible through its transformations processes. Cacheable: We can store all the data in persistent storage, memory, and disk. Memory (most preferred) and disk (less Preferred because of its slow access speed). Partitioned: Each dataset is logically partitioned and distributed across nodes over the cluster. They are just partitioned to enhance the processing, Not divided internally. This arrangement of partitions provides parallelism. Fault tolerance: While working on any node, if we lost any RDD itself recovers itself. When we apply different transformations on RDDs, it creates a logical execution plan. The logical execution plan is generally known as lineage graph. As a consequence, we may lose RDD as if any fault arises in the machine. So by applying the same computation on that node of the lineage graph, we can recover our same dataset again. As a matter of fact, this process enhances its property of Fault Tolerance. Location Stickiness: That DAG(Directed Acyclic Graph) scheduler use to place computing partitions on. DAG helps to manage the tasks as much close to the data to operate efficiently. This placing of data also enhances the speed of computations.
  • #7: Small slide with image to drive the point. Partitioning is the key to parallelism.
  • #8: Small slide with image to drive the point. Holds lineage to parent RDD can only be transformed.
  • #9: Small slide with image to drive the point. Just plans doesn’t compute unless the answer is asked for.
  • #11: SOME RANDOM IDEAS. HAVE TO BE TIED TOGETHER IN A HANDS ON Some online spark workbook for them to try out Creating a bare RDD from list, from csv file Parallelizing in RDD sc.parallelize(0 to 9) sc.parallelize(0 to 90 by 10) # Word count example and when Debug string is printed one can see lineage val wordCount = sc.textFile("README.md").flatMap(_.split("\\s+")).map((_, 1)).reduceByKey(_ + _) wordCount: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[21] at reduceByKey at <console>:24 scala> wordCount.toDebugString # Number of partitions wordCount.getNumPartitions
  • #12: SOME RANDOM IDEAS. HAVE TO BE TIED TOGETHER IN A HANDS ON Some online spark workbook for them to try out Creating a bare RDD from list, from csv file Parallelizing in RDD sc.parallelize(0 to 9) sc.parallelize(0 to 90 by 10) # Word count example and when Debug string is printed one can see lineage val wordCount = sc.textFile("README.md").flatMap(_.split("\\s+")).map((_, 1)).reduceByKey(_ + _) wordCount: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[21] at reduceByKey at <console>:24 scala> wordCount.toDebugString # Number of partitions wordCount.getNumPartitions
  翻译: