SlideShare a Scribd company logo
© Copyright 2017 Pivotal Software, Inc. All rights Reserved.
Mixing Analytic Workloads with Greenplum
and Apache Spark
Kong Yew, Chan
Product Manager
kochan@pivotal.io
Cover w/ Image
Agenda
■ Apache Spark for analytic
workloads
■ Mixing workloads with Greenplum
and Spark
■ Using the Greenplum-Spark
connector
Pivotal Data Suite Use Case
Applied to Predictive
Maintenance
Analytical workloads are changing as
businesses are demanding streaming and
real-time processing
The Data Lake is Valuable, but not a Panacea
• ACID-compliant transactions
• Full ANSI SQL compliance
• Immediate consistency vs eventual consistency
• Hundreds or thousands of concurrent queries
• Queries involving complex, multi-way joins requiring a sophisticated
optimizer
Many operations require the features of mature, relational MPP data platforms
Does Spark Replace the Data Warehouse?
Spark is an in-memory processing system, complements with data warehouse
Reasons:
• In-memory processing
• Memory limitations
• Data Movement
What if we could leverage
the best qualities of the
data warehouse and the
best qualities of Spark?
Why use Apache Spark for processing data ?
Features:
• 100x performance gain with in-memory analytical processing
• SQL for structured data processing
• Advanced analytics for machine learning, graph and streaming
Use Cases:
• Data exploration
• Interactive analytics
• Stream processing
Why use Greenplum for processing data ?
Features:
● Process analytics for entire dataset (in-memory and disks)
● Provide full ANSI SQL for structured data processing
● Advanced analytics for machine learning(Madlib), graph, geospatial, text
Use Cases:
● Large-scale data processing
● Advanced analytics for enterprise use cases
Mixing Analytic Workloads
Best for Greenplum
● Analytics over the entire dataset
● Processing multi-structured data
Best for Spark
● Limited data that fits Spark’s in-
memory platform
● ETL processing (streaming,
micro-batches)
● Data exploration
Pivotal Data Suite Use Case
Applied to Predictive
Maintenance
Using the Greenplum-Spark connector
Use Case: Financial Services
Parallel data
transfer
Financial risk
algorithms
MPP
Database
Use Cases:
● Analyzing financial risk
Benefits:
● Faster in-memory processing
● Expand data processing to Spark
GPDB-Spark
connector
Executor
Greenplum-Spark connector (GSC)
High speed parallel data transfer between GPDB and Spark
● Easy to use
● Optimize for performance
● Complement with Spark ecosystem
In-memory processingMPP database
Greenplum-Spark architecture
● Uses GPDB segments to transfer
data to Spark executors
● Scale dynamically (Kubernetes,
Yarn, Mesos)
● Support Spark programming
languages (Python, Scala, Java, R)
Easy to use
scala> :paste
// Entering paste mode (ctrl-D to finish)
val gscOptionMap = Map(
"url" -> "jdbc:postgresql://gpmaster.domain/tutorial",
"user" -> "user1",
"password" -> "pivotal",
"dbschema" -> "faa",
"dbtable" -> "otp_c",
"partitionColumn" -> "airlineid"
)
val gpdf = spark.read.format("greenplum")
.options(gscOptionMap)
.load()
// Exiting paste mode, now interpreting.
gpdf: org.apache.spark.sql.DataFrame = [flt_year: smallint, flt_quarter: smallint ... 44 more fields]
Performance optimization (Column Projection)
scala> paste:
// Entering paste mode (ctrl-D to finish)
scala> gpdf.select("origincityname", "flt_month", "airlineid", "carrier").show()
control-D
// Exiting paste mode, now interpreting.
+---------------+---------+---------+-------+
| origincityname|flt_month|airlineid|carrier|
+---------------+---------+---------+-------+
| Detroit, MI| 12| 19386| NW|
| Houston, TX| 12| 19704| CO|
| Houston, TX| 12| 19704| CO|
….
+--------------------+---------+---------+-------+
only showing top 20 rows
Performance optimization (Predicate Push down)
scala> paste:
// Entering paste mode (ctrl-D to finish)
scala> gpdf.select("origincityname", "flt_month", "airlineid", "carrier")
.filter("cancelled = 1").filter("flt_month = 12")
.orderBy("airlineid", "origincityname")
.show()
control-D
// Exiting paste mode, now interpreting.
+---------------+---------+---------+-------+
| origincityname|flt_month|airlineid|carrier|
+---------------+---------+---------+-------+
| Detroit, MI| 12| 19386| NW|
| Houston, TX| 12| 19704| CO|
...
+--------------------+---------+---------+-------+
only showing top 20 rows
Benefits of the Greenplum Spark connector
● Faster data transfer between GPDB and Spark
(75x faster than JDBC connector)
● Easy to use
● Performance (Column projection, Predicate push down)
Cover w/ Image
Key Takeaways
● Use mixed workloads for both
Greenplum and Spark
● Leverage both the Greenplum and
Spark ecosystems
Start Your Journey Today!
Pivotal Greenplum and Spark
Connector
pivotal.io/pivotal-greenplum
greenplum-spark.docs.pivotal.io
Pivotal Data Science
pivotal.io/data-science
Apache MADlib
madlib.apache.org
Greenplum Database
Channel
© Copyright 2017 Pivotal Software, Inc. All rights Reserved.
Questions?
Contact kochan@pivotal.io
Thank you for attending!
Ad

More Related Content

What's hot (20)

PostgreSQL Deep Internal
PostgreSQL Deep InternalPostgreSQL Deep Internal
PostgreSQL Deep Internal
EXEM
 
Airflow Best Practises & Roadmap to Airflow 2.0
Airflow Best Practises & Roadmap to Airflow 2.0Airflow Best Practises & Roadmap to Airflow 2.0
Airflow Best Practises & Roadmap to Airflow 2.0
Kaxil Naik
 
The Impala Cookbook
The Impala CookbookThe Impala Cookbook
The Impala Cookbook
Cloudera, Inc.
 
Apache Airflow overview
Apache Airflow overviewApache Airflow overview
Apache Airflow overview
NikolayGrishchenkov
 
The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...
The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...
The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...
Databricks
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
Databricks
 
Introduction to Apache Kudu
Introduction to Apache KuduIntroduction to Apache Kudu
Introduction to Apache Kudu
Jeff Holoman
 
Building a Streaming Microservice Architecture: with Apache Spark Structured ...
Building a Streaming Microservice Architecture: with Apache Spark Structured ...Building a Streaming Microservice Architecture: with Apache Spark Structured ...
Building a Streaming Microservice Architecture: with Apache Spark Structured ...
Databricks
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Databricks
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Databricks
 
Airflow presentation
Airflow presentationAirflow presentation
Airflow presentation
Anant Corporation
 
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaTuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Databricks
 
Dependency Injection in Apache Spark Applications
Dependency Injection in Apache Spark ApplicationsDependency Injection in Apache Spark Applications
Dependency Injection in Apache Spark Applications
Databricks
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsRunning Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Databricks
 
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
HostedbyConfluent
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Redo log improvements MYSQL 8.0
Redo log improvements MYSQL 8.0Redo log improvements MYSQL 8.0
Redo log improvements MYSQL 8.0
Mydbops
 
Airbyte @ Airflow Summit - The new modern data stack
Airbyte @ Airflow Summit - The new modern data stackAirbyte @ Airflow Summit - The new modern data stack
Airbyte @ Airflow Summit - The new modern data stack
Michel Tricot
 
New Directions for Apache Arrow
New Directions for Apache ArrowNew Directions for Apache Arrow
New Directions for Apache Arrow
Wes McKinney
 
3D: DBT using Databricks and Delta
3D: DBT using Databricks and Delta3D: DBT using Databricks and Delta
3D: DBT using Databricks and Delta
Databricks
 
PostgreSQL Deep Internal
PostgreSQL Deep InternalPostgreSQL Deep Internal
PostgreSQL Deep Internal
EXEM
 
Airflow Best Practises & Roadmap to Airflow 2.0
Airflow Best Practises & Roadmap to Airflow 2.0Airflow Best Practises & Roadmap to Airflow 2.0
Airflow Best Practises & Roadmap to Airflow 2.0
Kaxil Naik
 
The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...
The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...
The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...
Databricks
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
Databricks
 
Introduction to Apache Kudu
Introduction to Apache KuduIntroduction to Apache Kudu
Introduction to Apache Kudu
Jeff Holoman
 
Building a Streaming Microservice Architecture: with Apache Spark Structured ...
Building a Streaming Microservice Architecture: with Apache Spark Structured ...Building a Streaming Microservice Architecture: with Apache Spark Structured ...
Building a Streaming Microservice Architecture: with Apache Spark Structured ...
Databricks
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Databricks
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Databricks
 
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaTuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Databricks
 
Dependency Injection in Apache Spark Applications
Dependency Injection in Apache Spark ApplicationsDependency Injection in Apache Spark Applications
Dependency Injection in Apache Spark Applications
Databricks
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsRunning Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Databricks
 
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
HostedbyConfluent
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Redo log improvements MYSQL 8.0
Redo log improvements MYSQL 8.0Redo log improvements MYSQL 8.0
Redo log improvements MYSQL 8.0
Mydbops
 
Airbyte @ Airflow Summit - The new modern data stack
Airbyte @ Airflow Summit - The new modern data stackAirbyte @ Airflow Summit - The new modern data stack
Airbyte @ Airflow Summit - The new modern data stack
Michel Tricot
 
New Directions for Apache Arrow
New Directions for Apache ArrowNew Directions for Apache Arrow
New Directions for Apache Arrow
Wes McKinney
 
3D: DBT using Databricks and Delta
3D: DBT using Databricks and Delta3D: DBT using Databricks and Delta
3D: DBT using Databricks and Delta
Databricks
 

Similar to Mixing Analytic Workloads with Greenplum and Apache Spark (20)

Greenplum-Spark November 2018
Greenplum-Spark November 2018Greenplum-Spark November 2018
Greenplum-Spark November 2018
KongYew Chan, MBA
 
Advancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
Advancing GPU Analytics with RAPIDS Accelerator for Spark and AlluxioAdvancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
Advancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
Alluxio, Inc.
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analytics
inoshg
 
Explore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataExplore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and Snappydata
Data Con LA
 
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Anant Corporation
 
Big Telco - Yousun Jeong
Big Telco - Yousun JeongBig Telco - Yousun Jeong
Big Telco - Yousun Jeong
Spark Summit
 
Big Telco Real-Time Network Analytics
Big Telco Real-Time Network AnalyticsBig Telco Real-Time Network Analytics
Big Telco Real-Time Network Analytics
Yousun Jeong
 
Deploying Accelerators At Datacenter Scale Using Spark
Deploying Accelerators At Datacenter Scale Using SparkDeploying Accelerators At Datacenter Scale Using Spark
Deploying Accelerators At Datacenter Scale Using Spark
Jen Aman
 
Pyspark presentationfsfsfjspfsjfsfsfjsfpsfsf
Pyspark presentationfsfsfjspfsjfsfsfjsfpsfsfPyspark presentationfsfsfjspfsjfsfsfjsfpsfsf
Pyspark presentationfsfsfjspfsjfsfsfjsfpsfsf
sasuke20y4sh
 
Stsg17 speaker yousunjeong
Stsg17 speaker yousunjeongStsg17 speaker yousunjeong
Stsg17 speaker yousunjeong
Yousun Jeong
 
Spark Summit EU talk by Ahsan Javed Awan
Spark Summit EU talk by Ahsan Javed AwanSpark Summit EU talk by Ahsan Javed Awan
Spark Summit EU talk by Ahsan Javed Awan
Spark Summit
 
Apache Spark: Usage and Roadmap in Hadoop
Apache Spark: Usage and Roadmap in HadoopApache Spark: Usage and Roadmap in Hadoop
Apache Spark: Usage and Roadmap in Hadoop
Cloudera Japan
 
Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64
Ganesh Raju
 
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to ClusterBKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
Linaro
 
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to ClusterBKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
Linaro
 
High performance Spark distribution on PKS by SnappyData
High performance Spark distribution on PKS by SnappyDataHigh performance Spark distribution on PKS by SnappyData
High performance Spark distribution on PKS by SnappyData
Carlos Andrés García
 
High performance Spark distribution on PKS by SnappyData
High performance Spark distribution on PKS by SnappyDataHigh performance Spark distribution on PKS by SnappyData
High performance Spark distribution on PKS by SnappyData
VMware Tanzu
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
 
SnappyData overview NikeTechTalk 11/19/15
SnappyData overview NikeTechTalk 11/19/15SnappyData overview NikeTechTalk 11/19/15
SnappyData overview NikeTechTalk 11/19/15
SnappyData
 
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop EcosystemWhy Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Cloudera, Inc.
 
Greenplum-Spark November 2018
Greenplum-Spark November 2018Greenplum-Spark November 2018
Greenplum-Spark November 2018
KongYew Chan, MBA
 
Advancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
Advancing GPU Analytics with RAPIDS Accelerator for Spark and AlluxioAdvancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
Advancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
Alluxio, Inc.
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analytics
inoshg
 
Explore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataExplore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and Snappydata
Data Con LA
 
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Anant Corporation
 
Big Telco - Yousun Jeong
Big Telco - Yousun JeongBig Telco - Yousun Jeong
Big Telco - Yousun Jeong
Spark Summit
 
Big Telco Real-Time Network Analytics
Big Telco Real-Time Network AnalyticsBig Telco Real-Time Network Analytics
Big Telco Real-Time Network Analytics
Yousun Jeong
 
Deploying Accelerators At Datacenter Scale Using Spark
Deploying Accelerators At Datacenter Scale Using SparkDeploying Accelerators At Datacenter Scale Using Spark
Deploying Accelerators At Datacenter Scale Using Spark
Jen Aman
 
Pyspark presentationfsfsfjspfsjfsfsfjsfpsfsf
Pyspark presentationfsfsfjspfsjfsfsfjsfpsfsfPyspark presentationfsfsfjspfsjfsfsfjsfpsfsf
Pyspark presentationfsfsfjspfsjfsfsfjsfpsfsf
sasuke20y4sh
 
Stsg17 speaker yousunjeong
Stsg17 speaker yousunjeongStsg17 speaker yousunjeong
Stsg17 speaker yousunjeong
Yousun Jeong
 
Spark Summit EU talk by Ahsan Javed Awan
Spark Summit EU talk by Ahsan Javed AwanSpark Summit EU talk by Ahsan Javed Awan
Spark Summit EU talk by Ahsan Javed Awan
Spark Summit
 
Apache Spark: Usage and Roadmap in Hadoop
Apache Spark: Usage and Roadmap in HadoopApache Spark: Usage and Roadmap in Hadoop
Apache Spark: Usage and Roadmap in Hadoop
Cloudera Japan
 
Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64
Ganesh Raju
 
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to ClusterBKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
Linaro
 
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to ClusterBKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
Linaro
 
High performance Spark distribution on PKS by SnappyData
High performance Spark distribution on PKS by SnappyDataHigh performance Spark distribution on PKS by SnappyData
High performance Spark distribution on PKS by SnappyData
Carlos Andrés García
 
High performance Spark distribution on PKS by SnappyData
High performance Spark distribution on PKS by SnappyDataHigh performance Spark distribution on PKS by SnappyData
High performance Spark distribution on PKS by SnappyData
VMware Tanzu
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
 
SnappyData overview NikeTechTalk 11/19/15
SnappyData overview NikeTechTalk 11/19/15SnappyData overview NikeTechTalk 11/19/15
SnappyData overview NikeTechTalk 11/19/15
SnappyData
 
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop EcosystemWhy Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Cloudera, Inc.
 
Ad

More from VMware Tanzu (20)

Spring into AI presented by Dan Vega 5/14
Spring into AI presented by Dan Vega 5/14Spring into AI presented by Dan Vega 5/14
Spring into AI presented by Dan Vega 5/14
VMware Tanzu
 
What AI Means For Your Product Strategy And What To Do About It
What AI Means For Your Product Strategy And What To Do About ItWhat AI Means For Your Product Strategy And What To Do About It
What AI Means For Your Product Strategy And What To Do About It
VMware Tanzu
 
Make the Right Thing the Obvious Thing at Cardinal Health 2023
Make the Right Thing the Obvious Thing at Cardinal Health 2023Make the Right Thing the Obvious Thing at Cardinal Health 2023
Make the Right Thing the Obvious Thing at Cardinal Health 2023
VMware Tanzu
 
Enhancing DevEx and Simplifying Operations at Scale
Enhancing DevEx and Simplifying Operations at ScaleEnhancing DevEx and Simplifying Operations at Scale
Enhancing DevEx and Simplifying Operations at Scale
VMware Tanzu
 
Spring Update | July 2023
Spring Update | July 2023Spring Update | July 2023
Spring Update | July 2023
VMware Tanzu
 
Platforms, Platform Engineering, & Platform as a Product
Platforms, Platform Engineering, & Platform as a ProductPlatforms, Platform Engineering, & Platform as a Product
Platforms, Platform Engineering, & Platform as a Product
VMware Tanzu
 
Building Cloud Ready Apps
Building Cloud Ready AppsBuilding Cloud Ready Apps
Building Cloud Ready Apps
VMware Tanzu
 
Spring Boot 3 And Beyond
Spring Boot 3 And BeyondSpring Boot 3 And Beyond
Spring Boot 3 And Beyond
VMware Tanzu
 
Spring Cloud Gateway - SpringOne Tour 2023 Charles Schwab.pdf
Spring Cloud Gateway - SpringOne Tour 2023 Charles Schwab.pdfSpring Cloud Gateway - SpringOne Tour 2023 Charles Schwab.pdf
Spring Cloud Gateway - SpringOne Tour 2023 Charles Schwab.pdf
VMware Tanzu
 
Simplify and Scale Enterprise Apps in the Cloud | Boston 2023
Simplify and Scale Enterprise Apps in the Cloud | Boston 2023Simplify and Scale Enterprise Apps in the Cloud | Boston 2023
Simplify and Scale Enterprise Apps in the Cloud | Boston 2023
VMware Tanzu
 
Simplify and Scale Enterprise Apps in the Cloud | Seattle 2023
Simplify and Scale Enterprise Apps in the Cloud | Seattle 2023Simplify and Scale Enterprise Apps in the Cloud | Seattle 2023
Simplify and Scale Enterprise Apps in the Cloud | Seattle 2023
VMware Tanzu
 
tanzu_developer_connect.pptx
tanzu_developer_connect.pptxtanzu_developer_connect.pptx
tanzu_developer_connect.pptx
VMware Tanzu
 
Tanzu Virtual Developer Connect Workshop - French
Tanzu Virtual Developer Connect Workshop - FrenchTanzu Virtual Developer Connect Workshop - French
Tanzu Virtual Developer Connect Workshop - French
VMware Tanzu
 
Tanzu Developer Connect Workshop - English
Tanzu Developer Connect Workshop - EnglishTanzu Developer Connect Workshop - English
Tanzu Developer Connect Workshop - English
VMware Tanzu
 
Virtual Developer Connect Workshop - English
Virtual Developer Connect Workshop - EnglishVirtual Developer Connect Workshop - English
Virtual Developer Connect Workshop - English
VMware Tanzu
 
Tanzu Developer Connect - French
Tanzu Developer Connect - FrenchTanzu Developer Connect - French
Tanzu Developer Connect - French
VMware Tanzu
 
Simplify and Scale Enterprise Apps in the Cloud | Dallas 2023
Simplify and Scale Enterprise Apps in the Cloud | Dallas 2023Simplify and Scale Enterprise Apps in the Cloud | Dallas 2023
Simplify and Scale Enterprise Apps in the Cloud | Dallas 2023
VMware Tanzu
 
SpringOne Tour: Deliver 15-Factor Applications on Kubernetes with Spring Boot
SpringOne Tour: Deliver 15-Factor Applications on Kubernetes with Spring BootSpringOne Tour: Deliver 15-Factor Applications on Kubernetes with Spring Boot
SpringOne Tour: Deliver 15-Factor Applications on Kubernetes with Spring Boot
VMware Tanzu
 
SpringOne Tour: The Influential Software Engineer
SpringOne Tour: The Influential Software EngineerSpringOne Tour: The Influential Software Engineer
SpringOne Tour: The Influential Software Engineer
VMware Tanzu
 
SpringOne Tour: Domain-Driven Design: Theory vs Practice
SpringOne Tour: Domain-Driven Design: Theory vs PracticeSpringOne Tour: Domain-Driven Design: Theory vs Practice
SpringOne Tour: Domain-Driven Design: Theory vs Practice
VMware Tanzu
 
Spring into AI presented by Dan Vega 5/14
Spring into AI presented by Dan Vega 5/14Spring into AI presented by Dan Vega 5/14
Spring into AI presented by Dan Vega 5/14
VMware Tanzu
 
What AI Means For Your Product Strategy And What To Do About It
What AI Means For Your Product Strategy And What To Do About ItWhat AI Means For Your Product Strategy And What To Do About It
What AI Means For Your Product Strategy And What To Do About It
VMware Tanzu
 
Make the Right Thing the Obvious Thing at Cardinal Health 2023
Make the Right Thing the Obvious Thing at Cardinal Health 2023Make the Right Thing the Obvious Thing at Cardinal Health 2023
Make the Right Thing the Obvious Thing at Cardinal Health 2023
VMware Tanzu
 
Enhancing DevEx and Simplifying Operations at Scale
Enhancing DevEx and Simplifying Operations at ScaleEnhancing DevEx and Simplifying Operations at Scale
Enhancing DevEx and Simplifying Operations at Scale
VMware Tanzu
 
Spring Update | July 2023
Spring Update | July 2023Spring Update | July 2023
Spring Update | July 2023
VMware Tanzu
 
Platforms, Platform Engineering, & Platform as a Product
Platforms, Platform Engineering, & Platform as a ProductPlatforms, Platform Engineering, & Platform as a Product
Platforms, Platform Engineering, & Platform as a Product
VMware Tanzu
 
Building Cloud Ready Apps
Building Cloud Ready AppsBuilding Cloud Ready Apps
Building Cloud Ready Apps
VMware Tanzu
 
Spring Boot 3 And Beyond
Spring Boot 3 And BeyondSpring Boot 3 And Beyond
Spring Boot 3 And Beyond
VMware Tanzu
 
Spring Cloud Gateway - SpringOne Tour 2023 Charles Schwab.pdf
Spring Cloud Gateway - SpringOne Tour 2023 Charles Schwab.pdfSpring Cloud Gateway - SpringOne Tour 2023 Charles Schwab.pdf
Spring Cloud Gateway - SpringOne Tour 2023 Charles Schwab.pdf
VMware Tanzu
 
Simplify and Scale Enterprise Apps in the Cloud | Boston 2023
Simplify and Scale Enterprise Apps in the Cloud | Boston 2023Simplify and Scale Enterprise Apps in the Cloud | Boston 2023
Simplify and Scale Enterprise Apps in the Cloud | Boston 2023
VMware Tanzu
 
Simplify and Scale Enterprise Apps in the Cloud | Seattle 2023
Simplify and Scale Enterprise Apps in the Cloud | Seattle 2023Simplify and Scale Enterprise Apps in the Cloud | Seattle 2023
Simplify and Scale Enterprise Apps in the Cloud | Seattle 2023
VMware Tanzu
 
tanzu_developer_connect.pptx
tanzu_developer_connect.pptxtanzu_developer_connect.pptx
tanzu_developer_connect.pptx
VMware Tanzu
 
Tanzu Virtual Developer Connect Workshop - French
Tanzu Virtual Developer Connect Workshop - FrenchTanzu Virtual Developer Connect Workshop - French
Tanzu Virtual Developer Connect Workshop - French
VMware Tanzu
 
Tanzu Developer Connect Workshop - English
Tanzu Developer Connect Workshop - EnglishTanzu Developer Connect Workshop - English
Tanzu Developer Connect Workshop - English
VMware Tanzu
 
Virtual Developer Connect Workshop - English
Virtual Developer Connect Workshop - EnglishVirtual Developer Connect Workshop - English
Virtual Developer Connect Workshop - English
VMware Tanzu
 
Tanzu Developer Connect - French
Tanzu Developer Connect - FrenchTanzu Developer Connect - French
Tanzu Developer Connect - French
VMware Tanzu
 
Simplify and Scale Enterprise Apps in the Cloud | Dallas 2023
Simplify and Scale Enterprise Apps in the Cloud | Dallas 2023Simplify and Scale Enterprise Apps in the Cloud | Dallas 2023
Simplify and Scale Enterprise Apps in the Cloud | Dallas 2023
VMware Tanzu
 
SpringOne Tour: Deliver 15-Factor Applications on Kubernetes with Spring Boot
SpringOne Tour: Deliver 15-Factor Applications on Kubernetes with Spring BootSpringOne Tour: Deliver 15-Factor Applications on Kubernetes with Spring Boot
SpringOne Tour: Deliver 15-Factor Applications on Kubernetes with Spring Boot
VMware Tanzu
 
SpringOne Tour: The Influential Software Engineer
SpringOne Tour: The Influential Software EngineerSpringOne Tour: The Influential Software Engineer
SpringOne Tour: The Influential Software Engineer
VMware Tanzu
 
SpringOne Tour: Domain-Driven Design: Theory vs Practice
SpringOne Tour: Domain-Driven Design: Theory vs PracticeSpringOne Tour: Domain-Driven Design: Theory vs Practice
SpringOne Tour: Domain-Driven Design: Theory vs Practice
VMware Tanzu
 
Ad

Recently uploaded (20)

IT488 Wireless Sensor Networks_Information Technology
IT488 Wireless Sensor Networks_Information TechnologyIT488 Wireless Sensor Networks_Information Technology
IT488 Wireless Sensor Networks_Information Technology
SHEHABALYAMANI
 
Dark Dynamism: drones, dark factories and deurbanization
Dark Dynamism: drones, dark factories and deurbanizationDark Dynamism: drones, dark factories and deurbanization
Dark Dynamism: drones, dark factories and deurbanization
Jakub Šimek
 
Master Data Management - Enterprise Application Integration
Master Data Management - Enterprise Application IntegrationMaster Data Management - Enterprise Application Integration
Master Data Management - Enterprise Application Integration
Sherif Rasmy
 
Cybersecurity Tools and Technologies - Microsoft Certificate
Cybersecurity Tools and Technologies - Microsoft CertificateCybersecurity Tools and Technologies - Microsoft Certificate
Cybersecurity Tools and Technologies - Microsoft Certificate
VICTOR MAESTRE RAMIREZ
 
Digital Technologies for Culture, Arts and Heritage: Insights from Interdisci...
Digital Technologies for Culture, Arts and Heritage: Insights from Interdisci...Digital Technologies for Culture, Arts and Heritage: Insights from Interdisci...
Digital Technologies for Culture, Arts and Heritage: Insights from Interdisci...
Vasileios Komianos
 
MULTI-STAKEHOLDER CONSULTATION PROGRAM On Implementation of DNF 2.0 and Way F...
MULTI-STAKEHOLDER CONSULTATION PROGRAM On Implementation of DNF 2.0 and Way F...MULTI-STAKEHOLDER CONSULTATION PROGRAM On Implementation of DNF 2.0 and Way F...
MULTI-STAKEHOLDER CONSULTATION PROGRAM On Implementation of DNF 2.0 and Way F...
ICT Frame Magazine Pvt. Ltd.
 
React Native for Business Solutions: Building Scalable Apps for Success
React Native for Business Solutions: Building Scalable Apps for SuccessReact Native for Business Solutions: Building Scalable Apps for Success
React Native for Business Solutions: Building Scalable Apps for Success
Amelia Swank
 
May Patch Tuesday
May Patch TuesdayMay Patch Tuesday
May Patch Tuesday
Ivanti
 
Mastering Testing in the Modern F&B Landscape
Mastering Testing in the Modern F&B LandscapeMastering Testing in the Modern F&B Landscape
Mastering Testing in the Modern F&B Landscape
marketing943205
 
Integrating FME with Python: Tips, Demos, and Best Practices for Powerful Aut...
Integrating FME with Python: Tips, Demos, and Best Practices for Powerful Aut...Integrating FME with Python: Tips, Demos, and Best Practices for Powerful Aut...
Integrating FME with Python: Tips, Demos, and Best Practices for Powerful Aut...
Safe Software
 
accessibility Considerations during Design by Rick Blair, Schneider Electric
accessibility Considerations during Design by Rick Blair, Schneider Electricaccessibility Considerations during Design by Rick Blair, Schneider Electric
accessibility Considerations during Design by Rick Blair, Schneider Electric
UXPA Boston
 
DevOpsDays SLC - Platform Engineers are Product Managers.pptx
DevOpsDays SLC - Platform Engineers are Product Managers.pptxDevOpsDays SLC - Platform Engineers are Product Managers.pptx
DevOpsDays SLC - Platform Engineers are Product Managers.pptx
Justin Reock
 
Limecraft Webinar - 2025.3 release, featuring Content Delivery, Graphic Conte...
Limecraft Webinar - 2025.3 release, featuring Content Delivery, Graphic Conte...Limecraft Webinar - 2025.3 release, featuring Content Delivery, Graphic Conte...
Limecraft Webinar - 2025.3 release, featuring Content Delivery, Graphic Conte...
Maarten Verwaest
 
RTP Over QUIC: An Interesting Opportunity Or Wasted Time?
RTP Over QUIC: An Interesting Opportunity Or Wasted Time?RTP Over QUIC: An Interesting Opportunity Or Wasted Time?
RTP Over QUIC: An Interesting Opportunity Or Wasted Time?
Lorenzo Miniero
 
Top 5 Qualities to Look for in Salesforce Partners in 2025
Top 5 Qualities to Look for in Salesforce Partners in 2025Top 5 Qualities to Look for in Salesforce Partners in 2025
Top 5 Qualities to Look for in Salesforce Partners in 2025
Damco Salesforce Services
 
How Top Companies Benefit from Outsourcing
How Top Companies Benefit from OutsourcingHow Top Companies Benefit from Outsourcing
How Top Companies Benefit from Outsourcing
Nascenture
 
Artificial_Intelligence_in_Everyday_Life.pptx
Artificial_Intelligence_in_Everyday_Life.pptxArtificial_Intelligence_in_Everyday_Life.pptx
Artificial_Intelligence_in_Everyday_Life.pptx
03ANMOLCHAURASIYA
 
Cybersecurity Threat Vectors and Mitigation
Cybersecurity Threat Vectors and MitigationCybersecurity Threat Vectors and Mitigation
Cybersecurity Threat Vectors and Mitigation
VICTOR MAESTRE RAMIREZ
 
In-App Guidance_ Save Enterprises Millions in Training & IT Costs.pptx
In-App Guidance_ Save Enterprises Millions in Training & IT Costs.pptxIn-App Guidance_ Save Enterprises Millions in Training & IT Costs.pptx
In-App Guidance_ Save Enterprises Millions in Training & IT Costs.pptx
aptyai
 
UiPath AgentHack - Build the AI agents of tomorrow_Enablement 1.pptx
UiPath AgentHack - Build the AI agents of tomorrow_Enablement 1.pptxUiPath AgentHack - Build the AI agents of tomorrow_Enablement 1.pptx
UiPath AgentHack - Build the AI agents of tomorrow_Enablement 1.pptx
anabulhac
 
IT488 Wireless Sensor Networks_Information Technology
IT488 Wireless Sensor Networks_Information TechnologyIT488 Wireless Sensor Networks_Information Technology
IT488 Wireless Sensor Networks_Information Technology
SHEHABALYAMANI
 
Dark Dynamism: drones, dark factories and deurbanization
Dark Dynamism: drones, dark factories and deurbanizationDark Dynamism: drones, dark factories and deurbanization
Dark Dynamism: drones, dark factories and deurbanization
Jakub Šimek
 
Master Data Management - Enterprise Application Integration
Master Data Management - Enterprise Application IntegrationMaster Data Management - Enterprise Application Integration
Master Data Management - Enterprise Application Integration
Sherif Rasmy
 
Cybersecurity Tools and Technologies - Microsoft Certificate
Cybersecurity Tools and Technologies - Microsoft CertificateCybersecurity Tools and Technologies - Microsoft Certificate
Cybersecurity Tools and Technologies - Microsoft Certificate
VICTOR MAESTRE RAMIREZ
 
Digital Technologies for Culture, Arts and Heritage: Insights from Interdisci...
Digital Technologies for Culture, Arts and Heritage: Insights from Interdisci...Digital Technologies for Culture, Arts and Heritage: Insights from Interdisci...
Digital Technologies for Culture, Arts and Heritage: Insights from Interdisci...
Vasileios Komianos
 
MULTI-STAKEHOLDER CONSULTATION PROGRAM On Implementation of DNF 2.0 and Way F...
MULTI-STAKEHOLDER CONSULTATION PROGRAM On Implementation of DNF 2.0 and Way F...MULTI-STAKEHOLDER CONSULTATION PROGRAM On Implementation of DNF 2.0 and Way F...
MULTI-STAKEHOLDER CONSULTATION PROGRAM On Implementation of DNF 2.0 and Way F...
ICT Frame Magazine Pvt. Ltd.
 
React Native for Business Solutions: Building Scalable Apps for Success
React Native for Business Solutions: Building Scalable Apps for SuccessReact Native for Business Solutions: Building Scalable Apps for Success
React Native for Business Solutions: Building Scalable Apps for Success
Amelia Swank
 
May Patch Tuesday
May Patch TuesdayMay Patch Tuesday
May Patch Tuesday
Ivanti
 
Mastering Testing in the Modern F&B Landscape
Mastering Testing in the Modern F&B LandscapeMastering Testing in the Modern F&B Landscape
Mastering Testing in the Modern F&B Landscape
marketing943205
 
Integrating FME with Python: Tips, Demos, and Best Practices for Powerful Aut...
Integrating FME with Python: Tips, Demos, and Best Practices for Powerful Aut...Integrating FME with Python: Tips, Demos, and Best Practices for Powerful Aut...
Integrating FME with Python: Tips, Demos, and Best Practices for Powerful Aut...
Safe Software
 
accessibility Considerations during Design by Rick Blair, Schneider Electric
accessibility Considerations during Design by Rick Blair, Schneider Electricaccessibility Considerations during Design by Rick Blair, Schneider Electric
accessibility Considerations during Design by Rick Blair, Schneider Electric
UXPA Boston
 
DevOpsDays SLC - Platform Engineers are Product Managers.pptx
DevOpsDays SLC - Platform Engineers are Product Managers.pptxDevOpsDays SLC - Platform Engineers are Product Managers.pptx
DevOpsDays SLC - Platform Engineers are Product Managers.pptx
Justin Reock
 
Limecraft Webinar - 2025.3 release, featuring Content Delivery, Graphic Conte...
Limecraft Webinar - 2025.3 release, featuring Content Delivery, Graphic Conte...Limecraft Webinar - 2025.3 release, featuring Content Delivery, Graphic Conte...
Limecraft Webinar - 2025.3 release, featuring Content Delivery, Graphic Conte...
Maarten Verwaest
 
RTP Over QUIC: An Interesting Opportunity Or Wasted Time?
RTP Over QUIC: An Interesting Opportunity Or Wasted Time?RTP Over QUIC: An Interesting Opportunity Or Wasted Time?
RTP Over QUIC: An Interesting Opportunity Or Wasted Time?
Lorenzo Miniero
 
Top 5 Qualities to Look for in Salesforce Partners in 2025
Top 5 Qualities to Look for in Salesforce Partners in 2025Top 5 Qualities to Look for in Salesforce Partners in 2025
Top 5 Qualities to Look for in Salesforce Partners in 2025
Damco Salesforce Services
 
How Top Companies Benefit from Outsourcing
How Top Companies Benefit from OutsourcingHow Top Companies Benefit from Outsourcing
How Top Companies Benefit from Outsourcing
Nascenture
 
Artificial_Intelligence_in_Everyday_Life.pptx
Artificial_Intelligence_in_Everyday_Life.pptxArtificial_Intelligence_in_Everyday_Life.pptx
Artificial_Intelligence_in_Everyday_Life.pptx
03ANMOLCHAURASIYA
 
Cybersecurity Threat Vectors and Mitigation
Cybersecurity Threat Vectors and MitigationCybersecurity Threat Vectors and Mitigation
Cybersecurity Threat Vectors and Mitigation
VICTOR MAESTRE RAMIREZ
 
In-App Guidance_ Save Enterprises Millions in Training & IT Costs.pptx
In-App Guidance_ Save Enterprises Millions in Training & IT Costs.pptxIn-App Guidance_ Save Enterprises Millions in Training & IT Costs.pptx
In-App Guidance_ Save Enterprises Millions in Training & IT Costs.pptx
aptyai
 
UiPath AgentHack - Build the AI agents of tomorrow_Enablement 1.pptx
UiPath AgentHack - Build the AI agents of tomorrow_Enablement 1.pptxUiPath AgentHack - Build the AI agents of tomorrow_Enablement 1.pptx
UiPath AgentHack - Build the AI agents of tomorrow_Enablement 1.pptx
anabulhac
 

Mixing Analytic Workloads with Greenplum and Apache Spark

  • 1. © Copyright 2017 Pivotal Software, Inc. All rights Reserved. Mixing Analytic Workloads with Greenplum and Apache Spark Kong Yew, Chan Product Manager kochan@pivotal.io
  • 2. Cover w/ Image Agenda ■ Apache Spark for analytic workloads ■ Mixing workloads with Greenplum and Spark ■ Using the Greenplum-Spark connector
  • 3. Pivotal Data Suite Use Case Applied to Predictive Maintenance Analytical workloads are changing as businesses are demanding streaming and real-time processing
  • 4. The Data Lake is Valuable, but not a Panacea • ACID-compliant transactions • Full ANSI SQL compliance • Immediate consistency vs eventual consistency • Hundreds or thousands of concurrent queries • Queries involving complex, multi-way joins requiring a sophisticated optimizer Many operations require the features of mature, relational MPP data platforms
  • 5. Does Spark Replace the Data Warehouse? Spark is an in-memory processing system, complements with data warehouse Reasons: • In-memory processing • Memory limitations • Data Movement
  • 6. What if we could leverage the best qualities of the data warehouse and the best qualities of Spark?
  • 7. Why use Apache Spark for processing data ? Features: • 100x performance gain with in-memory analytical processing • SQL for structured data processing • Advanced analytics for machine learning, graph and streaming Use Cases: • Data exploration • Interactive analytics • Stream processing
  • 8. Why use Greenplum for processing data ? Features: ● Process analytics for entire dataset (in-memory and disks) ● Provide full ANSI SQL for structured data processing ● Advanced analytics for machine learning(Madlib), graph, geospatial, text Use Cases: ● Large-scale data processing ● Advanced analytics for enterprise use cases
  • 9. Mixing Analytic Workloads Best for Greenplum ● Analytics over the entire dataset ● Processing multi-structured data Best for Spark ● Limited data that fits Spark’s in- memory platform ● ETL processing (streaming, micro-batches) ● Data exploration
  • 10. Pivotal Data Suite Use Case Applied to Predictive Maintenance Using the Greenplum-Spark connector
  • 11. Use Case: Financial Services Parallel data transfer Financial risk algorithms MPP Database Use Cases: ● Analyzing financial risk Benefits: ● Faster in-memory processing ● Expand data processing to Spark GPDB-Spark connector Executor
  • 12. Greenplum-Spark connector (GSC) High speed parallel data transfer between GPDB and Spark ● Easy to use ● Optimize for performance ● Complement with Spark ecosystem In-memory processingMPP database
  • 13. Greenplum-Spark architecture ● Uses GPDB segments to transfer data to Spark executors ● Scale dynamically (Kubernetes, Yarn, Mesos) ● Support Spark programming languages (Python, Scala, Java, R)
  • 14. Easy to use scala> :paste // Entering paste mode (ctrl-D to finish) val gscOptionMap = Map( "url" -> "jdbc:postgresql://gpmaster.domain/tutorial", "user" -> "user1", "password" -> "pivotal", "dbschema" -> "faa", "dbtable" -> "otp_c", "partitionColumn" -> "airlineid" ) val gpdf = spark.read.format("greenplum") .options(gscOptionMap) .load() // Exiting paste mode, now interpreting. gpdf: org.apache.spark.sql.DataFrame = [flt_year: smallint, flt_quarter: smallint ... 44 more fields]
  • 15. Performance optimization (Column Projection) scala> paste: // Entering paste mode (ctrl-D to finish) scala> gpdf.select("origincityname", "flt_month", "airlineid", "carrier").show() control-D // Exiting paste mode, now interpreting. +---------------+---------+---------+-------+ | origincityname|flt_month|airlineid|carrier| +---------------+---------+---------+-------+ | Detroit, MI| 12| 19386| NW| | Houston, TX| 12| 19704| CO| | Houston, TX| 12| 19704| CO| …. +--------------------+---------+---------+-------+ only showing top 20 rows
  • 16. Performance optimization (Predicate Push down) scala> paste: // Entering paste mode (ctrl-D to finish) scala> gpdf.select("origincityname", "flt_month", "airlineid", "carrier") .filter("cancelled = 1").filter("flt_month = 12") .orderBy("airlineid", "origincityname") .show() control-D // Exiting paste mode, now interpreting. +---------------+---------+---------+-------+ | origincityname|flt_month|airlineid|carrier| +---------------+---------+---------+-------+ | Detroit, MI| 12| 19386| NW| | Houston, TX| 12| 19704| CO| ... +--------------------+---------+---------+-------+ only showing top 20 rows
  • 17. Benefits of the Greenplum Spark connector ● Faster data transfer between GPDB and Spark (75x faster than JDBC connector) ● Easy to use ● Performance (Column projection, Predicate push down)
  • 18. Cover w/ Image Key Takeaways ● Use mixed workloads for both Greenplum and Spark ● Leverage both the Greenplum and Spark ecosystems
  • 19. Start Your Journey Today! Pivotal Greenplum and Spark Connector pivotal.io/pivotal-greenplum greenplum-spark.docs.pivotal.io Pivotal Data Science pivotal.io/data-science Apache MADlib madlib.apache.org Greenplum Database Channel
  • 20. © Copyright 2017 Pivotal Software, Inc. All rights Reserved. Questions? Contact kochan@pivotal.io Thank you for attending!
  翻译: