SlideShare a Scribd company logo
RDD – Overview 
(Resilient Distributed Datasets*) 
{ 
Nov 1st 2014 
Oakland CA 
By Taposh Dutta Roy 
* Source: https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
Contents 
• What is RDD 
• Motivation Behind RDD 
• Use Cases for RDD 
• Challenges for RDD 
• RDD: Solve
What is RDD 
“RDDs are fault tolerant, parallel data structures 
that let users explicitly persist intermediate 
results in memory, control their partitioning to 
optimize data placement, and manipulate them 
using a rich set of operations. “ 
In a nutshell RDDs are a level of abstraction 
that enable efficient data reuse in a broad 
range of applications
Motivation behind RDD 
Current frameworks like MapReduce & 
Dyrad provide a numerous abstractions for 
accessing a cluster’s computational resources 
but lack abstractions for leveraging the 
distributed memory !!! 
Data reuse is common in many iterative 
machine learning algorithms such as – Page 
Rank, K-means Clustering & Logistic 
Regression.
Motivation behind RDD 
Another use case is when an user runs 
multiple adhoc queries on the same subset of 
data. 
Unfortunately in current frameworks, the 
only way to reuse data between 
computations i.e between two jobs is to write 
to an external storage system e.g. a 
distributed file system such as Amazon S3.
Use cases for RDD 
1. Solving Iterative problems 
Existing Solution – Slow, needs high I/O 
RDD - Fast, in memory
Use cases for RDD 
Example: Suppose I have to look at the 
webserver access logs and look for an 
error_code or certain text.
Use cases for RDD 
Example (cont’d) : I run the above code on server 
which returns a set of files with the words 
looked for grepped, closes the cluster and puts 
the file into an Amazon S3 location specified in 
the script. 
Now we look at the result files and need to 
extract some other text from this file, we will 
need to write or use another set of map-reduce 
code. This might take extra time to fetch the files, 
process and provide the results.
Use cases for RDD 
RDD solves this problem by storing the data 
in memory and providing a ability for the 
user to requery the subset.
Use cases for RDD 
2. Solving Interactive Problems 
The second use case is its usage in interactive 
algorithms such as logistic regression which need 
the data to be re-used.
Challenge for RDD 
The main challenge in designing RDD is 
defining a programming interface that can 
provide fault tolerance efficiently.
Challenges for RDD 
Existing solutions such as distributed shared 
memory, key value stores, & databases offer 
an interface based on fine-grained updates. 
With such systems, the only way to get 
fault tolerance is to replicate the data across 
machines or to log updates across machines. Both 
of these approaches are data intensive. They 
need high bandwidth to move the data over 
the cluster network and large storage.
RDD: Solve 
RDD solves these probems by providing an 
interface based on coarse grained 
transformations such as map, filter and join. 
These transformations apply the same 
operations to many data items. 
This allows them to efficiently provide fault 
tolerance by logging the transformations 
used to build a dataset (i.e. lineage) rather 
than actual data. If a partition of RDD is lost, 
the RDD has enough information about how 
it ..
RDD: Solve 
(Cont’d) was derived from other RDD to 
recompute just that partition. The lost data 
can be recovered quickly, without costly 
replication.
Applications not suitable : RDD 
RDDs would be less suitable for applications 
that make asynchronous fine grained updates 
to shared state, such as a storage system for a 
web application or an incremental web 
crawller. For such applications traditional 
update logging and data checkpointing 
such as databases.
Conclusion RDD 
RDD's goal is to provide an efficient 
programming model for batch 
analytics. 
RDD has been implemented in a system 
called SPARK.
Ad

More Related Content

What's hot (20)

Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
Vadim Y. Bichutskiy
 
Hadoop technology
Hadoop technologyHadoop technology
Hadoop technology
tipanagiriharika
 
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Simplilearn
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & libraries
Walaa Hamdy Assy
 
Mining Data Streams
Mining Data StreamsMining Data Streams
Mining Data Streams
SujaAldrin
 
Distributed Database System
Distributed Database SystemDistributed Database System
Distributed Database System
Sulemang
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
EMC
 
Map reduce presentation
Map reduce presentationMap reduce presentation
Map reduce presentation
ateeq ateeq
 
Hadoop Architecture
Hadoop ArchitectureHadoop Architecture
Hadoop Architecture
Dr. C.V. Suresh Babu
 
NoSQL databases - An introduction
NoSQL databases - An introductionNoSQL databases - An introduction
NoSQL databases - An introduction
Pooyan Mehrparvar
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
Prashant Gupta
 
Nosql data models
Nosql data modelsNosql data models
Nosql data models
Viet-Trung TRAN
 
Introduction to HDFS
Introduction to HDFSIntroduction to HDFS
Introduction to HDFS
Bhavesh Padharia
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Anastasios Skarlatidis
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQL
jeykottalam
 
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Simplilearn
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark
Mostafa
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Simplilearn
 
introduction to NOSQL Database
introduction to NOSQL Databaseintroduction to NOSQL Database
introduction to NOSQL Database
nehabsairam
 
Sqoop
SqoopSqoop
Sqoop
Prashant Gupta
 
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Simplilearn
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & libraries
Walaa Hamdy Assy
 
Mining Data Streams
Mining Data StreamsMining Data Streams
Mining Data Streams
SujaAldrin
 
Distributed Database System
Distributed Database SystemDistributed Database System
Distributed Database System
Sulemang
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
EMC
 
Map reduce presentation
Map reduce presentationMap reduce presentation
Map reduce presentation
ateeq ateeq
 
NoSQL databases - An introduction
NoSQL databases - An introductionNoSQL databases - An introduction
NoSQL databases - An introduction
Pooyan Mehrparvar
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQL
jeykottalam
 
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Simplilearn
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark
Mostafa
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Simplilearn
 
introduction to NOSQL Database
introduction to NOSQL Databaseintroduction to NOSQL Database
introduction to NOSQL Database
nehabsairam
 

Viewers also liked (20)

Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDs
Dean Chen
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
Databricks
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Sachin Aggarwal
 
Resilient Distributed Datasets
Resilient Distributed DatasetsResilient Distributed Datasets
Resilient Distributed Datasets
Gabriele Modena
 
Simple Log Analysis and Trending
Simple Log Analysis and TrendingSimple Log Analysis and Trending
Simple Log Analysis and Trending
Mike Brittain
 
Anatomy of Data Source API : A deep dive into Spark Data source API
Anatomy of Data Source API : A deep dive into Spark Data source APIAnatomy of Data Source API : A deep dive into Spark Data source API
Anatomy of Data Source API : A deep dive into Spark Data source API
datamantra
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
Anton Kirillov
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark Internals
Pietro Michiardi
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Databricks
 
Resilient Distributed Datasets
Resilient Distributed DatasetsResilient Distributed Datasets
Resilient Distributed Datasets
Alessandro Menabò
 
BDAS RDD study report v1.2
BDAS RDD study report v1.2BDAS RDD study report v1.2
BDAS RDD study report v1.2
Stefanie Zhao
 
East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
 East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
Chris Fregly
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
Robert Sanders
 
Writing your own RDD for fun and profit
Writing your own RDD for fun and profitWriting your own RDD for fun and profit
Writing your own RDD for fun and profit
Pawel Szulc
 
Transformations and actions a visual guide training
Transformations and actions a visual guide trainingTransformations and actions a visual guide training
Transformations and actions a visual guide training
Spark Summit
 
IBM Spark Meetup - RDD & Spark Basics
IBM Spark Meetup - RDD & Spark BasicsIBM Spark Meetup - RDD & Spark Basics
IBM Spark Meetup - RDD & Spark Basics
Satya Narayan
 
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Deep Dive : Spark Data Frames, SQL and Catalyst OptimizerDeep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Sachin Aggarwal
 
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
BigDataEverywhere
 
Spark RDD : Transformations & Actions
Spark RDD : Transformations & ActionsSpark RDD : Transformations & Actions
Spark RDD : Transformations & Actions
MICHRAFY MUSTAFA
 
หนังสือภาษาไทย Spark Internal
หนังสือภาษาไทย Spark Internalหนังสือภาษาไทย Spark Internal
หนังสือภาษาไทย Spark Internal
Bhuridech Sudsee
 
Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDs
Dean Chen
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
Databricks
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Sachin Aggarwal
 
Resilient Distributed Datasets
Resilient Distributed DatasetsResilient Distributed Datasets
Resilient Distributed Datasets
Gabriele Modena
 
Simple Log Analysis and Trending
Simple Log Analysis and TrendingSimple Log Analysis and Trending
Simple Log Analysis and Trending
Mike Brittain
 
Anatomy of Data Source API : A deep dive into Spark Data source API
Anatomy of Data Source API : A deep dive into Spark Data source APIAnatomy of Data Source API : A deep dive into Spark Data source API
Anatomy of Data Source API : A deep dive into Spark Data source API
datamantra
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
Anton Kirillov
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark Internals
Pietro Michiardi
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Databricks
 
Resilient Distributed Datasets
Resilient Distributed DatasetsResilient Distributed Datasets
Resilient Distributed Datasets
Alessandro Menabò
 
BDAS RDD study report v1.2
BDAS RDD study report v1.2BDAS RDD study report v1.2
BDAS RDD study report v1.2
Stefanie Zhao
 
East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
 East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
Chris Fregly
 
Writing your own RDD for fun and profit
Writing your own RDD for fun and profitWriting your own RDD for fun and profit
Writing your own RDD for fun and profit
Pawel Szulc
 
Transformations and actions a visual guide training
Transformations and actions a visual guide trainingTransformations and actions a visual guide training
Transformations and actions a visual guide training
Spark Summit
 
IBM Spark Meetup - RDD & Spark Basics
IBM Spark Meetup - RDD & Spark BasicsIBM Spark Meetup - RDD & Spark Basics
IBM Spark Meetup - RDD & Spark Basics
Satya Narayan
 
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Deep Dive : Spark Data Frames, SQL and Catalyst OptimizerDeep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Sachin Aggarwal
 
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
BigDataEverywhere
 
Spark RDD : Transformations & Actions
Spark RDD : Transformations & ActionsSpark RDD : Transformations & Actions
Spark RDD : Transformations & Actions
MICHRAFY MUSTAFA
 
หนังสือภาษาไทย Spark Internal
หนังสือภาษาไทย Spark Internalหนังสือภาษาไทย Spark Internal
หนังสือภาษาไทย Spark Internal
Bhuridech Sudsee
 
Ad

Similar to Resilient Distributed DataSets - Apache SPARK (20)

Spark cluster computing with working sets
Spark cluster computing with working setsSpark cluster computing with working sets
Spark cluster computing with working sets
JinxinTang
 
A big-data architecture for real-time analytics
A big-data architecture for real-time analyticsA big-data architecture for real-time analytics
A big-data architecture for real-time analytics
ramikaurraminder
 
Big Data: RDBMS vs. Hadoop vs. Spark
Big Data: RDBMS vs. Hadoop vs. SparkBig Data: RDBMS vs. Hadoop vs. Spark
Big Data: RDBMS vs. Hadoop vs. Spark
Graisy Biswal
 
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015 Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Sameer Farooqui
 
Secrets of Spark's success - Deenar Toraskar, Think Reactive
Secrets of Spark's success - Deenar Toraskar, Think Reactive Secrets of Spark's success - Deenar Toraskar, Think Reactive
Secrets of Spark's success - Deenar Toraskar, Think Reactive
huguk
 
Study Notes: Apache Spark
Study Notes: Apache SparkStudy Notes: Apache Spark
Study Notes: Apache Spark
Gao Yunzhong
 
Iaetsd mapreduce streaming over cassandra datasets
Iaetsd mapreduce streaming over cassandra datasetsIaetsd mapreduce streaming over cassandra datasets
Iaetsd mapreduce streaming over cassandra datasets
Iaetsd Iaetsd
 
Cloud Computing Ambiance using Secluded Access Control Method
Cloud Computing Ambiance using Secluded Access Control MethodCloud Computing Ambiance using Secluded Access Control Method
Cloud Computing Ambiance using Secluded Access Control Method
IRJET Journal
 
Database
DatabaseDatabase
Database
Zahid Soomro
 
Design Issues and Challenges of Peer-to-Peer Video on Demand System
Design Issues and Challenges of Peer-to-Peer Video on Demand System Design Issues and Challenges of Peer-to-Peer Video on Demand System
Design Issues and Challenges of Peer-to-Peer Video on Demand System
cscpconf
 
Survey of Parallel Data Processing in Context with MapReduce
Survey of Parallel Data Processing in Context with MapReduce Survey of Parallel Data Processing in Context with MapReduce
Survey of Parallel Data Processing in Context with MapReduce
cscpconf
 
Redis vs Memcached
Redis vs MemcachedRedis vs Memcached
Redis vs Memcached
Gaurav Agrawal
 
Hadoop by kamran khan
Hadoop by kamran khanHadoop by kamran khan
Hadoop by kamran khan
KamranKhan587
 
MAP REDUCE BASED ON CLOAK DHT DATA REPLICATION EVALUATION
MAP REDUCE BASED ON CLOAK DHT DATA REPLICATION EVALUATIONMAP REDUCE BASED ON CLOAK DHT DATA REPLICATION EVALUATION
MAP REDUCE BASED ON CLOAK DHT DATA REPLICATION EVALUATION
IJDMS
 
Map Reduce based on Cloak DHT Data Replication Evaluation
Map Reduce based on Cloak DHT Data Replication EvaluationMap Reduce based on Cloak DHT Data Replication Evaluation
Map Reduce based on Cloak DHT Data Replication Evaluation
IJDMS
 
Map Reduce based on Cloak DHT Data Replication Evaluation
Map Reduce based on Cloak DHT Data Replication EvaluationMap Reduce based on Cloak DHT Data Replication Evaluation
Map Reduce based on Cloak DHT Data Replication Evaluation
IJDMS
 
Big Data SSD Architecture: Digging Deep to Discover Where SSD Performance Pay...
Big Data SSD Architecture: Digging Deep to Discover Where SSD Performance Pay...Big Data SSD Architecture: Digging Deep to Discover Where SSD Performance Pay...
Big Data SSD Architecture: Digging Deep to Discover Where SSD Performance Pay...
Samsung Business USA
 
Hadoop versus RDBMS - Comparing the two data paradigms
Hadoop versus RDBMS - Comparing the two data paradigmsHadoop versus RDBMS - Comparing the two data paradigms
Hadoop versus RDBMS - Comparing the two data paradigms
Jigisha Aryya
 
Harnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution TimesHarnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution Times
David Tjahjono,MD,MBA(UK)
 
D04501036040
D04501036040D04501036040
D04501036040
ijceronline
 
Spark cluster computing with working sets
Spark cluster computing with working setsSpark cluster computing with working sets
Spark cluster computing with working sets
JinxinTang
 
A big-data architecture for real-time analytics
A big-data architecture for real-time analyticsA big-data architecture for real-time analytics
A big-data architecture for real-time analytics
ramikaurraminder
 
Big Data: RDBMS vs. Hadoop vs. Spark
Big Data: RDBMS vs. Hadoop vs. SparkBig Data: RDBMS vs. Hadoop vs. Spark
Big Data: RDBMS vs. Hadoop vs. Spark
Graisy Biswal
 
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015 Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Sameer Farooqui
 
Secrets of Spark's success - Deenar Toraskar, Think Reactive
Secrets of Spark's success - Deenar Toraskar, Think Reactive Secrets of Spark's success - Deenar Toraskar, Think Reactive
Secrets of Spark's success - Deenar Toraskar, Think Reactive
huguk
 
Study Notes: Apache Spark
Study Notes: Apache SparkStudy Notes: Apache Spark
Study Notes: Apache Spark
Gao Yunzhong
 
Iaetsd mapreduce streaming over cassandra datasets
Iaetsd mapreduce streaming over cassandra datasetsIaetsd mapreduce streaming over cassandra datasets
Iaetsd mapreduce streaming over cassandra datasets
Iaetsd Iaetsd
 
Cloud Computing Ambiance using Secluded Access Control Method
Cloud Computing Ambiance using Secluded Access Control MethodCloud Computing Ambiance using Secluded Access Control Method
Cloud Computing Ambiance using Secluded Access Control Method
IRJET Journal
 
Design Issues and Challenges of Peer-to-Peer Video on Demand System
Design Issues and Challenges of Peer-to-Peer Video on Demand System Design Issues and Challenges of Peer-to-Peer Video on Demand System
Design Issues and Challenges of Peer-to-Peer Video on Demand System
cscpconf
 
Survey of Parallel Data Processing in Context with MapReduce
Survey of Parallel Data Processing in Context with MapReduce Survey of Parallel Data Processing in Context with MapReduce
Survey of Parallel Data Processing in Context with MapReduce
cscpconf
 
Hadoop by kamran khan
Hadoop by kamran khanHadoop by kamran khan
Hadoop by kamran khan
KamranKhan587
 
MAP REDUCE BASED ON CLOAK DHT DATA REPLICATION EVALUATION
MAP REDUCE BASED ON CLOAK DHT DATA REPLICATION EVALUATIONMAP REDUCE BASED ON CLOAK DHT DATA REPLICATION EVALUATION
MAP REDUCE BASED ON CLOAK DHT DATA REPLICATION EVALUATION
IJDMS
 
Map Reduce based on Cloak DHT Data Replication Evaluation
Map Reduce based on Cloak DHT Data Replication EvaluationMap Reduce based on Cloak DHT Data Replication Evaluation
Map Reduce based on Cloak DHT Data Replication Evaluation
IJDMS
 
Map Reduce based on Cloak DHT Data Replication Evaluation
Map Reduce based on Cloak DHT Data Replication EvaluationMap Reduce based on Cloak DHT Data Replication Evaluation
Map Reduce based on Cloak DHT Data Replication Evaluation
IJDMS
 
Big Data SSD Architecture: Digging Deep to Discover Where SSD Performance Pay...
Big Data SSD Architecture: Digging Deep to Discover Where SSD Performance Pay...Big Data SSD Architecture: Digging Deep to Discover Where SSD Performance Pay...
Big Data SSD Architecture: Digging Deep to Discover Where SSD Performance Pay...
Samsung Business USA
 
Hadoop versus RDBMS - Comparing the two data paradigms
Hadoop versus RDBMS - Comparing the two data paradigmsHadoop versus RDBMS - Comparing the two data paradigms
Hadoop versus RDBMS - Comparing the two data paradigms
Jigisha Aryya
 
Harnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution TimesHarnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution Times
David Tjahjono,MD,MBA(UK)
 
Ad

More from Taposh Roy (20)

Image annotation - Segmentation & Annotation
Image annotation - Segmentation & AnnotationImage annotation - Segmentation & Annotation
Image annotation - Segmentation & Annotation
Taposh Roy
 
Wal mart health_care_2017_dec
Wal mart health_care_2017_decWal mart health_care_2017_dec
Wal mart health_care_2017_dec
Taposh Roy
 
Predictive modeling healthcare
Predictive modeling healthcarePredictive modeling healthcare
Predictive modeling healthcare
Taposh Roy
 
Basic elements-of-strategy-framework
Basic elements-of-strategy-frameworkBasic elements-of-strategy-framework
Basic elements-of-strategy-framework
Taposh Roy
 
Kaggle bikeshare Competition - Part 1
Kaggle bikeshare Competition  - Part 1Kaggle bikeshare Competition  - Part 1
Kaggle bikeshare Competition - Part 1
Taposh Roy
 
Airline industry analysis - Boeing & Airbus
Airline industry analysis - Boeing & AirbusAirline industry analysis - Boeing & Airbus
Airline industry analysis - Boeing & Airbus
Taposh Roy
 
Energy industry report
Energy industry reportEnergy industry report
Energy industry report
Taposh Roy
 
Consumer electronics bm_retail
Consumer electronics bm_retailConsumer electronics bm_retail
Consumer electronics bm_retail
Taposh Roy
 
Multi Asset Endowment Investment Strategy
Multi Asset Endowment Investment StrategyMulti Asset Endowment Investment Strategy
Multi Asset Endowment Investment Strategy
Taposh Roy
 
Competitor Analysis for RSG Consulting
Competitor Analysis for RSG ConsultingCompetitor Analysis for RSG Consulting
Competitor Analysis for RSG Consulting
Taposh Roy
 
Financial Analysis boeing airbus
Financial Analysis boeing airbusFinancial Analysis boeing airbus
Financial Analysis boeing airbus
Taposh Roy
 
Sprint softbank (Merger Analysis)
Sprint softbank (Merger Analysis)Sprint softbank (Merger Analysis)
Sprint softbank (Merger Analysis)
Taposh Roy
 
M a analysis_roche_genentech
M a analysis_roche_genentechM a analysis_roche_genentech
M a analysis_roche_genentech
Taposh Roy
 
Land rover north america (HBS 9-596036)
Land rover north america (HBS 9-596036)Land rover north america (HBS 9-596036)
Land rover north america (HBS 9-596036)
Taposh Roy
 
American airlines - Value Pricing 1992
American airlines - Value Pricing 1992American airlines - Value Pricing 1992
American airlines - Value Pricing 1992
Taposh Roy
 
Strategy frameworks-and-models
Strategy frameworks-and-modelsStrategy frameworks-and-models
Strategy frameworks-and-models
Taposh Roy
 
Tesla in UAE (Financial Strategy)
Tesla in UAE (Financial Strategy)Tesla in UAE (Financial Strategy)
Tesla in UAE (Financial Strategy)
Taposh Roy
 
Understandingplatform
UnderstandingplatformUnderstandingplatform
Understandingplatform
Taposh Roy
 
Disney hbs9 701-035
Disney hbs9 701-035Disney hbs9 701-035
Disney hbs9 701-035
Taposh Roy
 
Best buy-analysis
Best buy-analysisBest buy-analysis
Best buy-analysis
Taposh Roy
 
Image annotation - Segmentation & Annotation
Image annotation - Segmentation & AnnotationImage annotation - Segmentation & Annotation
Image annotation - Segmentation & Annotation
Taposh Roy
 
Wal mart health_care_2017_dec
Wal mart health_care_2017_decWal mart health_care_2017_dec
Wal mart health_care_2017_dec
Taposh Roy
 
Predictive modeling healthcare
Predictive modeling healthcarePredictive modeling healthcare
Predictive modeling healthcare
Taposh Roy
 
Basic elements-of-strategy-framework
Basic elements-of-strategy-frameworkBasic elements-of-strategy-framework
Basic elements-of-strategy-framework
Taposh Roy
 
Kaggle bikeshare Competition - Part 1
Kaggle bikeshare Competition  - Part 1Kaggle bikeshare Competition  - Part 1
Kaggle bikeshare Competition - Part 1
Taposh Roy
 
Airline industry analysis - Boeing & Airbus
Airline industry analysis - Boeing & AirbusAirline industry analysis - Boeing & Airbus
Airline industry analysis - Boeing & Airbus
Taposh Roy
 
Energy industry report
Energy industry reportEnergy industry report
Energy industry report
Taposh Roy
 
Consumer electronics bm_retail
Consumer electronics bm_retailConsumer electronics bm_retail
Consumer electronics bm_retail
Taposh Roy
 
Multi Asset Endowment Investment Strategy
Multi Asset Endowment Investment StrategyMulti Asset Endowment Investment Strategy
Multi Asset Endowment Investment Strategy
Taposh Roy
 
Competitor Analysis for RSG Consulting
Competitor Analysis for RSG ConsultingCompetitor Analysis for RSG Consulting
Competitor Analysis for RSG Consulting
Taposh Roy
 
Financial Analysis boeing airbus
Financial Analysis boeing airbusFinancial Analysis boeing airbus
Financial Analysis boeing airbus
Taposh Roy
 
Sprint softbank (Merger Analysis)
Sprint softbank (Merger Analysis)Sprint softbank (Merger Analysis)
Sprint softbank (Merger Analysis)
Taposh Roy
 
M a analysis_roche_genentech
M a analysis_roche_genentechM a analysis_roche_genentech
M a analysis_roche_genentech
Taposh Roy
 
Land rover north america (HBS 9-596036)
Land rover north america (HBS 9-596036)Land rover north america (HBS 9-596036)
Land rover north america (HBS 9-596036)
Taposh Roy
 
American airlines - Value Pricing 1992
American airlines - Value Pricing 1992American airlines - Value Pricing 1992
American airlines - Value Pricing 1992
Taposh Roy
 
Strategy frameworks-and-models
Strategy frameworks-and-modelsStrategy frameworks-and-models
Strategy frameworks-and-models
Taposh Roy
 
Tesla in UAE (Financial Strategy)
Tesla in UAE (Financial Strategy)Tesla in UAE (Financial Strategy)
Tesla in UAE (Financial Strategy)
Taposh Roy
 
Understandingplatform
UnderstandingplatformUnderstandingplatform
Understandingplatform
Taposh Roy
 
Disney hbs9 701-035
Disney hbs9 701-035Disney hbs9 701-035
Disney hbs9 701-035
Taposh Roy
 
Best buy-analysis
Best buy-analysisBest buy-analysis
Best buy-analysis
Taposh Roy
 

Recently uploaded (20)

Parameter-Efficient Fine-Tuning (PEFT) techniques across language, vision, ge...
Parameter-Efficient Fine-Tuning (PEFT) techniques across language, vision, ge...Parameter-Efficient Fine-Tuning (PEFT) techniques across language, vision, ge...
Parameter-Efficient Fine-Tuning (PEFT) techniques across language, vision, ge...
roshinijoga
 
Efficient Algorithms for Isogeny Computation on Hyperelliptic Curves: Their A...
Efficient Algorithms for Isogeny Computation on Hyperelliptic Curves: Their A...Efficient Algorithms for Isogeny Computation on Hyperelliptic Curves: Their A...
Efficient Algorithms for Isogeny Computation on Hyperelliptic Curves: Their A...
IJCNCJournal
 
Lecture - 7 Canals of the topic of the civil engineering
Lecture - 7  Canals of the topic of the civil engineeringLecture - 7  Canals of the topic of the civil engineering
Lecture - 7 Canals of the topic of the civil engineering
MJawadkhan1
 
A Survey of Personalized Large Language Models.pptx
A Survey of Personalized Large Language Models.pptxA Survey of Personalized Large Language Models.pptx
A Survey of Personalized Large Language Models.pptx
rutujabhaskarraopati
 
JRR Tolkien’s Lord of the Rings: Was It Influenced by Nordic Mythology, Homer...
JRR Tolkien’s Lord of the Rings: Was It Influenced by Nordic Mythology, Homer...JRR Tolkien’s Lord of the Rings: Was It Influenced by Nordic Mythology, Homer...
JRR Tolkien’s Lord of the Rings: Was It Influenced by Nordic Mythology, Homer...
Reflections on Morality, Philosophy, and History
 
ML_Unit_VI_DEEP LEARNING_Introduction to ANN.pdf
ML_Unit_VI_DEEP LEARNING_Introduction to ANN.pdfML_Unit_VI_DEEP LEARNING_Introduction to ANN.pdf
ML_Unit_VI_DEEP LEARNING_Introduction to ANN.pdf
rameshwarchintamani
 
Applications of Centroid in Structural Engineering
Applications of Centroid in Structural EngineeringApplications of Centroid in Structural Engineering
Applications of Centroid in Structural Engineering
suvrojyotihalder2006
 
Modelling of Concrete Compressive Strength Admixed with GGBFS Using Gene Expr...
Modelling of Concrete Compressive Strength Admixed with GGBFS Using Gene Expr...Modelling of Concrete Compressive Strength Admixed with GGBFS Using Gene Expr...
Modelling of Concrete Compressive Strength Admixed with GGBFS Using Gene Expr...
Journal of Soft Computing in Civil Engineering
 
ATAL 6 Days Online FDP Scheme Document 2025-26.pdf
ATAL 6 Days Online FDP Scheme Document 2025-26.pdfATAL 6 Days Online FDP Scheme Document 2025-26.pdf
ATAL 6 Days Online FDP Scheme Document 2025-26.pdf
ssuserda39791
 
Building-Services-Introduction-Notes.pdf
Building-Services-Introduction-Notes.pdfBuilding-Services-Introduction-Notes.pdf
Building-Services-Introduction-Notes.pdf
Lawrence Omai
 
Routing Riverdale - A New Bus Connection
Routing Riverdale - A New Bus ConnectionRouting Riverdale - A New Bus Connection
Routing Riverdale - A New Bus Connection
jzb7232
 
Evonik Overview Visiomer Specialty Methacrylates.pdf
Evonik Overview Visiomer Specialty Methacrylates.pdfEvonik Overview Visiomer Specialty Methacrylates.pdf
Evonik Overview Visiomer Specialty Methacrylates.pdf
szhang13
 
Working with USDOT UTCs: From Conception to Implementation
Working with USDOT UTCs: From Conception to ImplementationWorking with USDOT UTCs: From Conception to Implementation
Working with USDOT UTCs: From Conception to Implementation
Alabama Transportation Assistance Program
 
Generative AI & Large Language Models Agents
Generative AI & Large Language Models AgentsGenerative AI & Large Language Models Agents
Generative AI & Large Language Models Agents
aasgharbee22seecs
 
Control Methods of Noise Pollutions.pptx
Control Methods of Noise Pollutions.pptxControl Methods of Noise Pollutions.pptx
Control Methods of Noise Pollutions.pptx
vvsasane
 
Surveying through global positioning system
Surveying through global positioning systemSurveying through global positioning system
Surveying through global positioning system
opneptune5
 
Machine Learning basics POWERPOINT PRESENETATION
Machine Learning basics POWERPOINT PRESENETATIONMachine Learning basics POWERPOINT PRESENETATION
Machine Learning basics POWERPOINT PRESENETATION
DarrinBright1
 
Machine foundation notes for civil engineering students
Machine foundation notes for civil engineering studentsMachine foundation notes for civil engineering students
Machine foundation notes for civil engineering students
DYPCET
 
Frontend Architecture Diagram/Guide For Frontend Engineers
Frontend Architecture Diagram/Guide For Frontend EngineersFrontend Architecture Diagram/Guide For Frontend Engineers
Frontend Architecture Diagram/Guide For Frontend Engineers
Michael Hertzberg
 
seninarppt.pptx1bhjiikjhggghjykoirgjuyhhhjj
seninarppt.pptx1bhjiikjhggghjykoirgjuyhhhjjseninarppt.pptx1bhjiikjhggghjykoirgjuyhhhjj
seninarppt.pptx1bhjiikjhggghjykoirgjuyhhhjj
AjijahamadKhaji
 
Parameter-Efficient Fine-Tuning (PEFT) techniques across language, vision, ge...
Parameter-Efficient Fine-Tuning (PEFT) techniques across language, vision, ge...Parameter-Efficient Fine-Tuning (PEFT) techniques across language, vision, ge...
Parameter-Efficient Fine-Tuning (PEFT) techniques across language, vision, ge...
roshinijoga
 
Efficient Algorithms for Isogeny Computation on Hyperelliptic Curves: Their A...
Efficient Algorithms for Isogeny Computation on Hyperelliptic Curves: Their A...Efficient Algorithms for Isogeny Computation on Hyperelliptic Curves: Their A...
Efficient Algorithms for Isogeny Computation on Hyperelliptic Curves: Their A...
IJCNCJournal
 
Lecture - 7 Canals of the topic of the civil engineering
Lecture - 7  Canals of the topic of the civil engineeringLecture - 7  Canals of the topic of the civil engineering
Lecture - 7 Canals of the topic of the civil engineering
MJawadkhan1
 
A Survey of Personalized Large Language Models.pptx
A Survey of Personalized Large Language Models.pptxA Survey of Personalized Large Language Models.pptx
A Survey of Personalized Large Language Models.pptx
rutujabhaskarraopati
 
ML_Unit_VI_DEEP LEARNING_Introduction to ANN.pdf
ML_Unit_VI_DEEP LEARNING_Introduction to ANN.pdfML_Unit_VI_DEEP LEARNING_Introduction to ANN.pdf
ML_Unit_VI_DEEP LEARNING_Introduction to ANN.pdf
rameshwarchintamani
 
Applications of Centroid in Structural Engineering
Applications of Centroid in Structural EngineeringApplications of Centroid in Structural Engineering
Applications of Centroid in Structural Engineering
suvrojyotihalder2006
 
ATAL 6 Days Online FDP Scheme Document 2025-26.pdf
ATAL 6 Days Online FDP Scheme Document 2025-26.pdfATAL 6 Days Online FDP Scheme Document 2025-26.pdf
ATAL 6 Days Online FDP Scheme Document 2025-26.pdf
ssuserda39791
 
Building-Services-Introduction-Notes.pdf
Building-Services-Introduction-Notes.pdfBuilding-Services-Introduction-Notes.pdf
Building-Services-Introduction-Notes.pdf
Lawrence Omai
 
Routing Riverdale - A New Bus Connection
Routing Riverdale - A New Bus ConnectionRouting Riverdale - A New Bus Connection
Routing Riverdale - A New Bus Connection
jzb7232
 
Evonik Overview Visiomer Specialty Methacrylates.pdf
Evonik Overview Visiomer Specialty Methacrylates.pdfEvonik Overview Visiomer Specialty Methacrylates.pdf
Evonik Overview Visiomer Specialty Methacrylates.pdf
szhang13
 
Generative AI & Large Language Models Agents
Generative AI & Large Language Models AgentsGenerative AI & Large Language Models Agents
Generative AI & Large Language Models Agents
aasgharbee22seecs
 
Control Methods of Noise Pollutions.pptx
Control Methods of Noise Pollutions.pptxControl Methods of Noise Pollutions.pptx
Control Methods of Noise Pollutions.pptx
vvsasane
 
Surveying through global positioning system
Surveying through global positioning systemSurveying through global positioning system
Surveying through global positioning system
opneptune5
 
Machine Learning basics POWERPOINT PRESENETATION
Machine Learning basics POWERPOINT PRESENETATIONMachine Learning basics POWERPOINT PRESENETATION
Machine Learning basics POWERPOINT PRESENETATION
DarrinBright1
 
Machine foundation notes for civil engineering students
Machine foundation notes for civil engineering studentsMachine foundation notes for civil engineering students
Machine foundation notes for civil engineering students
DYPCET
 
Frontend Architecture Diagram/Guide For Frontend Engineers
Frontend Architecture Diagram/Guide For Frontend EngineersFrontend Architecture Diagram/Guide For Frontend Engineers
Frontend Architecture Diagram/Guide For Frontend Engineers
Michael Hertzberg
 
seninarppt.pptx1bhjiikjhggghjykoirgjuyhhhjj
seninarppt.pptx1bhjiikjhggghjykoirgjuyhhhjjseninarppt.pptx1bhjiikjhggghjykoirgjuyhhhjj
seninarppt.pptx1bhjiikjhggghjykoirgjuyhhhjj
AjijahamadKhaji
 

Resilient Distributed DataSets - Apache SPARK

  • 1. RDD – Overview (Resilient Distributed Datasets*) { Nov 1st 2014 Oakland CA By Taposh Dutta Roy * Source: https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
  • 2. Contents • What is RDD • Motivation Behind RDD • Use Cases for RDD • Challenges for RDD • RDD: Solve
  • 3. What is RDD “RDDs are fault tolerant, parallel data structures that let users explicitly persist intermediate results in memory, control their partitioning to optimize data placement, and manipulate them using a rich set of operations. “ In a nutshell RDDs are a level of abstraction that enable efficient data reuse in a broad range of applications
  • 4. Motivation behind RDD Current frameworks like MapReduce & Dyrad provide a numerous abstractions for accessing a cluster’s computational resources but lack abstractions for leveraging the distributed memory !!! Data reuse is common in many iterative machine learning algorithms such as – Page Rank, K-means Clustering & Logistic Regression.
  • 5. Motivation behind RDD Another use case is when an user runs multiple adhoc queries on the same subset of data. Unfortunately in current frameworks, the only way to reuse data between computations i.e between two jobs is to write to an external storage system e.g. a distributed file system such as Amazon S3.
  • 6. Use cases for RDD 1. Solving Iterative problems Existing Solution – Slow, needs high I/O RDD - Fast, in memory
  • 7. Use cases for RDD Example: Suppose I have to look at the webserver access logs and look for an error_code or certain text.
  • 8. Use cases for RDD Example (cont’d) : I run the above code on server which returns a set of files with the words looked for grepped, closes the cluster and puts the file into an Amazon S3 location specified in the script. Now we look at the result files and need to extract some other text from this file, we will need to write or use another set of map-reduce code. This might take extra time to fetch the files, process and provide the results.
  • 9. Use cases for RDD RDD solves this problem by storing the data in memory and providing a ability for the user to requery the subset.
  • 10. Use cases for RDD 2. Solving Interactive Problems The second use case is its usage in interactive algorithms such as logistic regression which need the data to be re-used.
  • 11. Challenge for RDD The main challenge in designing RDD is defining a programming interface that can provide fault tolerance efficiently.
  • 12. Challenges for RDD Existing solutions such as distributed shared memory, key value stores, & databases offer an interface based on fine-grained updates. With such systems, the only way to get fault tolerance is to replicate the data across machines or to log updates across machines. Both of these approaches are data intensive. They need high bandwidth to move the data over the cluster network and large storage.
  • 13. RDD: Solve RDD solves these probems by providing an interface based on coarse grained transformations such as map, filter and join. These transformations apply the same operations to many data items. This allows them to efficiently provide fault tolerance by logging the transformations used to build a dataset (i.e. lineage) rather than actual data. If a partition of RDD is lost, the RDD has enough information about how it ..
  • 14. RDD: Solve (Cont’d) was derived from other RDD to recompute just that partition. The lost data can be recovered quickly, without costly replication.
  • 15. Applications not suitable : RDD RDDs would be less suitable for applications that make asynchronous fine grained updates to shared state, such as a storage system for a web application or an incremental web crawller. For such applications traditional update logging and data checkpointing such as databases.
  • 16. Conclusion RDD RDD's goal is to provide an efficient programming model for batch analytics. RDD has been implemented in a system called SPARK.
  翻译: