SlideShare a Scribd company logo
Apache Flink Tutorial
DataSet API
Who Am I?
● Newbie in Apache Flink
● BlueWell Technology
○ Big Data Architect
○ Focuses
■ Open DC/OS
■ CoreOS
■ Kubernetes
■ Apache Flink
■ Data Science
Agenda
● Apache Flink Type System
○ Atomic
○ Composite
○ Tuple
○ POJO
○ Scala case class
● Transformations
○ Transformations on DataSet
○ Rich Functions
○ Accumulators & Counters
○ Annotations
Agenda
● Iterations
○ Bulk iterations
○ Delta iterations
Basic Structure Of Apache Flink Programs
● For each Apache Flink Program, the basic structure is listed as follows.
○ Obtain an execution environment.
■ ExecutionEnvironment.getExecutionEnvironment()
○ Load/create DataSets from data sources.
■ readFile(), readTextFile(), readCsvFile(), etc.
■ fromElements(), fromCollection(), etc.
○ Execute some transformations on the DataSets.
■ filter(), map(), reduce(), etc.
○ Specify where to save results of the computations.
■ write(), writeAsCsv(), etc.
■ collect(), print(), etc.
○ Trigger the program execution.
Hands-on
BasicStructure
Apache Flink Type System
● Flink attempts to support all data types
○ Facilitate programming
○ Seamlessly integrate with legacy code
● Flink analyzes applications before execution for
○ Identifying used data types
○ Determining serializers and comparators
● Data types could be
○ Atomic data types
○ Composite data types
Composite Data Types
● Composite Data Types include
○ Tuples
■ In Java
■ In Scala
○ POJOs
○ Scala case class
Tuple Data Types
● Flink supports Tuple in
○ Java: org.apache.flink.api.java.tuple.Tuple<n>
■ n = 1, …, 25
○ Scala: premitive Tuple<n>
■ n = 1, …, 22
● Key declarations
○ Field index
○ E.g., dataset.groupBy(0).sum(1)
○ E.g., dataset.groupBy(“_1”).sum(“_2”)
POJOs
● POJOs – A java class with
○ A default constructor without parameters.
○ All fields are
■ public or
■ private but have getter & setter
■ Ex.
public class Car {
public int id;
public String brand;
public Car() {};
public Car(int id, String brand) {…};
}
POJOs
● Key Declarations
○ Field name as a string
○ E.g., cars.groupBy(“brand”)
Scala Case Class
● Primitive Scala case classes are supported
○ E.g., case class Car(id: Int, brand: String)
● Key declarations
○ Field name as a string
■ E.g., cars.groupBy(“brand”)
○ Field name
■ E.g., cars.groupBy(_.brand)
Hands-on
TypeSystem
DataSet Transformations
Hands-on
Transformation
Introduction To Rich Functions
● Purpose
○ Implement complicated user-defined functions
○ Broadcast variables
○ Access to Accumulators & Counters
● Structure
○ open(): initialization, one-shot setup work.
○ close(): tear-down, clean-up work.
○ get/setRuntimeContext(): access to RuntimeContext.
○ corresponding transformation method, e.g., map, join, etc.
Broadcast Variables
● Register
○ dataset.map(new RichMapFunction())
.withBroadcastSet(toBroadcast, “varName”)
● Access in Rich Functions
○ Initialize the broadcasted variables in open() via
○ getRuntimeContext().getBroadcastVariable(“varName”)
○ Access them in the whole class scope.
Accumulators & Counters
● Purpose
○ Debugging
○ First glance of DataSets
● Counters are kinds of accumulator
● Structure
○ An add operation
○ A final accumulated result (available after the job ended)
● Flink will automatically sum up all partial results.
Accumulators & Counters
● Built-in Accumulators
○ IntCounter, LongCounter, DoubleCounter
○ Histogram: map from integer to integer, distributions
● Register
○ new IntCounter()
○ getRuntimeContext().addAccumulator(“accuName”, counter)
● Access
○ In Rich Functions
■ getRuntimeContext().getAccumulator(“accuName”)
○ In the end of job
■ JobExecutionResult.getAccumulatorResult(“accuName")
Semantic Annotations
● Give Flink hints about the behavior of a function
○ A powerful means to speed up execution
○ Reusing sort orders or partitions across multiple operations
○ Prevent programs from unnecessary data shuffling or unnecessary sorts
● Types of Annotation
○ Forwarded fields annotations (@ForwardedFields)
○ Non-forwarded fields annotations (@NonForwardedFields)
■ Black or White in place
○ Read fields annotations (@ReadFields)
○ Fields to be read and evaluated
Iterations
● Bulk iterations
○ Partial Solution
○ Iteration Result
Iterations
● Delta Iterations
○ Workset / Update Solution Set
○ Iteration Result
The End
Thanks!!
Ad

More Related Content

What's hot (20)

Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy FarkasVirtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
Flink Forward
 
InfluxDB & Grafana
InfluxDB & GrafanaInfluxDB & Grafana
InfluxDB & Grafana
Pedro Salgado
 
A Brief Introduction of TiDB (Percona Live)
A Brief Introduction of TiDB (Percona Live)A Brief Introduction of TiDB (Percona Live)
A Brief Introduction of TiDB (Percona Live)
PingCAP
 
Introduction to InfluxDB, an Open Source Distributed Time Series Database by ...
Introduction to InfluxDB, an Open Source Distributed Time Series Database by ...Introduction to InfluxDB, an Open Source Distributed Time Series Database by ...
Introduction to InfluxDB, an Open Source Distributed Time Series Database by ...
Hakka Labs
 
Tick
TickTick
Tick
Vincenzo Ferrari
 
Modern Java Features
Modern Java Features Modern Java Features
Modern Java Features
Florian Hopf
 
Data Engineer’s Lunch #41: PygramETL
Data Engineer’s Lunch #41: PygramETLData Engineer’s Lunch #41: PygramETL
Data Engineer’s Lunch #41: PygramETL
Anant Corporation
 
InfluxDB and Grafana: An Introduction to Time-Based Data Storage and Visualiz...
InfluxDB and Grafana: An Introduction to Time-Based Data Storage and Visualiz...InfluxDB and Grafana: An Introduction to Time-Based Data Storage and Visualiz...
InfluxDB and Grafana: An Introduction to Time-Based Data Storage and Visualiz...
Caner Ünal
 
Flink Forward SF 2017: Stephan Ewen - Experiences running Flink at Very Large...
Flink Forward SF 2017: Stephan Ewen - Experiences running Flink at Very Large...Flink Forward SF 2017: Stephan Ewen - Experiences running Flink at Very Large...
Flink Forward SF 2017: Stephan Ewen - Experiences running Flink at Very Large...
Flink Forward
 
Flink Forward SF 2017: Timo Walther - Table & SQL API – unified APIs for bat...
Flink Forward SF 2017: Timo Walther -  Table & SQL API – unified APIs for bat...Flink Forward SF 2017: Timo Walther -  Table & SQL API – unified APIs for bat...
Flink Forward SF 2017: Timo Walther - Table & SQL API – unified APIs for bat...
Flink Forward
 
What Reika Taught us
What Reika Taught usWhat Reika Taught us
What Reika Taught us
University of California, Santa Cruz
 
Flink Forward SF 2017: Kenneth Knowles - Back to Sessions overview
Flink Forward SF 2017: Kenneth Knowles - Back to Sessions overviewFlink Forward SF 2017: Kenneth Knowles - Back to Sessions overview
Flink Forward SF 2017: Kenneth Knowles - Back to Sessions overview
Flink Forward
 
Measure your app internals with InfluxDB and Symfony2
Measure your app internals with InfluxDB and Symfony2Measure your app internals with InfluxDB and Symfony2
Measure your app internals with InfluxDB and Symfony2
Corley S.r.l.
 
Javantura v3 - ELK – Big Data for DevOps – Maarten Mulders
Javantura v3 - ELK – Big Data for DevOps – Maarten MuldersJavantura v3 - ELK – Big Data for DevOps – Maarten Mulders
Javantura v3 - ELK – Big Data for DevOps – Maarten Mulders
HUJAK - Hrvatska udruga Java korisnika / Croatian Java User Association
 
Streaming data to s3 using akka streams
Streaming data to s3 using akka streamsStreaming data to s3 using akka streams
Streaming data to s3 using akka streams
Mikhail Girkin
 
Scale Relational Database with NewSQL
Scale Relational Database with NewSQLScale Relational Database with NewSQL
Scale Relational Database with NewSQL
PingCAP
 
Scalable real-time processing techniques
Scalable real-time processing techniquesScalable real-time processing techniques
Scalable real-time processing techniques
Lars Albertsson
 
Building an Observability platform with ClickHouse
Building an Observability platform with ClickHouseBuilding an Observability platform with ClickHouse
Building an Observability platform with ClickHouse
Altinity Ltd
 
Monitoring in a scalable world
Monitoring in a scalable worldMonitoring in a scalable world
Monitoring in a scalable world
TechExeter
 
Building a transactional key-value store that scales to 100+ nodes (percona l...
Building a transactional key-value store that scales to 100+ nodes (percona l...Building a transactional key-value store that scales to 100+ nodes (percona l...
Building a transactional key-value store that scales to 100+ nodes (percona l...
PingCAP
 
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy FarkasVirtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
Flink Forward
 
A Brief Introduction of TiDB (Percona Live)
A Brief Introduction of TiDB (Percona Live)A Brief Introduction of TiDB (Percona Live)
A Brief Introduction of TiDB (Percona Live)
PingCAP
 
Introduction to InfluxDB, an Open Source Distributed Time Series Database by ...
Introduction to InfluxDB, an Open Source Distributed Time Series Database by ...Introduction to InfluxDB, an Open Source Distributed Time Series Database by ...
Introduction to InfluxDB, an Open Source Distributed Time Series Database by ...
Hakka Labs
 
Modern Java Features
Modern Java Features Modern Java Features
Modern Java Features
Florian Hopf
 
Data Engineer’s Lunch #41: PygramETL
Data Engineer’s Lunch #41: PygramETLData Engineer’s Lunch #41: PygramETL
Data Engineer’s Lunch #41: PygramETL
Anant Corporation
 
InfluxDB and Grafana: An Introduction to Time-Based Data Storage and Visualiz...
InfluxDB and Grafana: An Introduction to Time-Based Data Storage and Visualiz...InfluxDB and Grafana: An Introduction to Time-Based Data Storage and Visualiz...
InfluxDB and Grafana: An Introduction to Time-Based Data Storage and Visualiz...
Caner Ünal
 
Flink Forward SF 2017: Stephan Ewen - Experiences running Flink at Very Large...
Flink Forward SF 2017: Stephan Ewen - Experiences running Flink at Very Large...Flink Forward SF 2017: Stephan Ewen - Experiences running Flink at Very Large...
Flink Forward SF 2017: Stephan Ewen - Experiences running Flink at Very Large...
Flink Forward
 
Flink Forward SF 2017: Timo Walther - Table & SQL API – unified APIs for bat...
Flink Forward SF 2017: Timo Walther -  Table & SQL API – unified APIs for bat...Flink Forward SF 2017: Timo Walther -  Table & SQL API – unified APIs for bat...
Flink Forward SF 2017: Timo Walther - Table & SQL API – unified APIs for bat...
Flink Forward
 
Flink Forward SF 2017: Kenneth Knowles - Back to Sessions overview
Flink Forward SF 2017: Kenneth Knowles - Back to Sessions overviewFlink Forward SF 2017: Kenneth Knowles - Back to Sessions overview
Flink Forward SF 2017: Kenneth Knowles - Back to Sessions overview
Flink Forward
 
Measure your app internals with InfluxDB and Symfony2
Measure your app internals with InfluxDB and Symfony2Measure your app internals with InfluxDB and Symfony2
Measure your app internals with InfluxDB and Symfony2
Corley S.r.l.
 
Streaming data to s3 using akka streams
Streaming data to s3 using akka streamsStreaming data to s3 using akka streams
Streaming data to s3 using akka streams
Mikhail Girkin
 
Scale Relational Database with NewSQL
Scale Relational Database with NewSQLScale Relational Database with NewSQL
Scale Relational Database with NewSQL
PingCAP
 
Scalable real-time processing techniques
Scalable real-time processing techniquesScalable real-time processing techniques
Scalable real-time processing techniques
Lars Albertsson
 
Building an Observability platform with ClickHouse
Building an Observability platform with ClickHouseBuilding an Observability platform with ClickHouse
Building an Observability platform with ClickHouse
Altinity Ltd
 
Monitoring in a scalable world
Monitoring in a scalable worldMonitoring in a scalable world
Monitoring in a scalable world
TechExeter
 
Building a transactional key-value store that scales to 100+ nodes (percona l...
Building a transactional key-value store that scales to 100+ nodes (percona l...Building a transactional key-value store that scales to 100+ nodes (percona l...
Building a transactional key-value store that scales to 100+ nodes (percona l...
PingCAP
 

Viewers also liked (15)

Apache Software Foundation: How To Contribute, with Apache Flink as Example (...
Apache Software Foundation: How To Contribute, with Apache Flink as Example (...Apache Software Foundation: How To Contribute, with Apache Flink as Example (...
Apache Software Foundation: How To Contribute, with Apache Flink as Example (...
Apache Flink Taiwan User Group
 
Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)
Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)
Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)
Apache Flink Taiwan User Group
 
Apache Flink Training Workshop @ HadoopCon2016 - #4 Advanced Stream Processing
Apache Flink Training Workshop @ HadoopCon2016 - #4 Advanced Stream ProcessingApache Flink Training Workshop @ HadoopCon2016 - #4 Advanced Stream Processing
Apache Flink Training Workshop @ HadoopCon2016 - #4 Advanced Stream Processing
Apache Flink Taiwan User Group
 
Yarn Resource Management Using Machine Learning
Yarn Resource Management Using Machine LearningYarn Resource Management Using Machine Learning
Yarn Resource Management Using Machine Learning
ojavajava
 
Hadoop con 2016_9_10_王經篤(Jing-Doo Wang)
Hadoop con 2016_9_10_王經篤(Jing-Doo Wang)Hadoop con 2016_9_10_王經篤(Jing-Doo Wang)
Hadoop con 2016_9_10_王經篤(Jing-Doo Wang)
Jing-Doo Wang
 
How to plan a hadoop cluster for testing and production environment
How to plan a hadoop cluster for testing and production environmentHow to plan a hadoop cluster for testing and production environment
How to plan a hadoop cluster for testing and production environment
Anna Yen
 
2016-07-12 Introduction to Big Data Platform Security
2016-07-12 Introduction to Big Data Platform Security2016-07-12 Introduction to Big Data Platform Security
2016-07-12 Introduction to Big Data Platform Security
Jazz Yao-Tsung Wang
 
2016 Hadoop Conf TW - 如何建置數據精靈
2016 Hadoop Conf TW - 如何建置數據精靈2016 Hadoop Conf TW - 如何建置數據精靈
2016 Hadoop Conf TW - 如何建置數據精靈
晨揚 施
 
Flink Apachecon Presentation
Flink Apachecon PresentationFlink Apachecon Presentation
Flink Apachecon Presentation
Gyula Fóra
 
HadoopCon 2016 - 用 Jupyter Notebook Hold 住一個上線 Spark Machine Learning 專案實戰
HadoopCon 2016  - 用 Jupyter Notebook Hold 住一個上線 Spark  Machine Learning 專案實戰HadoopCon 2016  - 用 Jupyter Notebook Hold 住一個上線 Spark  Machine Learning 專案實戰
HadoopCon 2016 - 用 Jupyter Notebook Hold 住一個上線 Spark Machine Learning 專案實戰
Wayne Chen
 
BI in Xuenn
BI in XuennBI in Xuenn
BI in Xuenn
Len Chang
 
HadoopCon'16, Taipei @myui
HadoopCon'16, Taipei @myuiHadoopCon'16, Taipei @myui
HadoopCon'16, Taipei @myui
Makoto Yui
 
Achieve big data analytic platform with lambda architecture on cloud
Achieve big data analytic platform with lambda architecture on cloudAchieve big data analytic platform with lambda architecture on cloud
Achieve big data analytic platform with lambda architecture on cloud
Scott Miao
 
SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)
wqchen
 
Hadoop con2016 - Implement Real-time Centralized logging System by Elastic Stack
Hadoop con2016 - Implement Real-time Centralized logging System by Elastic StackHadoop con2016 - Implement Real-time Centralized logging System by Elastic Stack
Hadoop con2016 - Implement Real-time Centralized logging System by Elastic Stack
Len Chang
 
Apache Software Foundation: How To Contribute, with Apache Flink as Example (...
Apache Software Foundation: How To Contribute, with Apache Flink as Example (...Apache Software Foundation: How To Contribute, with Apache Flink as Example (...
Apache Software Foundation: How To Contribute, with Apache Flink as Example (...
Apache Flink Taiwan User Group
 
Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)
Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)
Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)
Apache Flink Taiwan User Group
 
Apache Flink Training Workshop @ HadoopCon2016 - #4 Advanced Stream Processing
Apache Flink Training Workshop @ HadoopCon2016 - #4 Advanced Stream ProcessingApache Flink Training Workshop @ HadoopCon2016 - #4 Advanced Stream Processing
Apache Flink Training Workshop @ HadoopCon2016 - #4 Advanced Stream Processing
Apache Flink Taiwan User Group
 
Yarn Resource Management Using Machine Learning
Yarn Resource Management Using Machine LearningYarn Resource Management Using Machine Learning
Yarn Resource Management Using Machine Learning
ojavajava
 
Hadoop con 2016_9_10_王經篤(Jing-Doo Wang)
Hadoop con 2016_9_10_王經篤(Jing-Doo Wang)Hadoop con 2016_9_10_王經篤(Jing-Doo Wang)
Hadoop con 2016_9_10_王經篤(Jing-Doo Wang)
Jing-Doo Wang
 
How to plan a hadoop cluster for testing and production environment
How to plan a hadoop cluster for testing and production environmentHow to plan a hadoop cluster for testing and production environment
How to plan a hadoop cluster for testing and production environment
Anna Yen
 
2016-07-12 Introduction to Big Data Platform Security
2016-07-12 Introduction to Big Data Platform Security2016-07-12 Introduction to Big Data Platform Security
2016-07-12 Introduction to Big Data Platform Security
Jazz Yao-Tsung Wang
 
2016 Hadoop Conf TW - 如何建置數據精靈
2016 Hadoop Conf TW - 如何建置數據精靈2016 Hadoop Conf TW - 如何建置數據精靈
2016 Hadoop Conf TW - 如何建置數據精靈
晨揚 施
 
Flink Apachecon Presentation
Flink Apachecon PresentationFlink Apachecon Presentation
Flink Apachecon Presentation
Gyula Fóra
 
HadoopCon 2016 - 用 Jupyter Notebook Hold 住一個上線 Spark Machine Learning 專案實戰
HadoopCon 2016  - 用 Jupyter Notebook Hold 住一個上線 Spark  Machine Learning 專案實戰HadoopCon 2016  - 用 Jupyter Notebook Hold 住一個上線 Spark  Machine Learning 專案實戰
HadoopCon 2016 - 用 Jupyter Notebook Hold 住一個上線 Spark Machine Learning 專案實戰
Wayne Chen
 
HadoopCon'16, Taipei @myui
HadoopCon'16, Taipei @myuiHadoopCon'16, Taipei @myui
HadoopCon'16, Taipei @myui
Makoto Yui
 
Achieve big data analytic platform with lambda architecture on cloud
Achieve big data analytic platform with lambda architecture on cloudAchieve big data analytic platform with lambda architecture on cloud
Achieve big data analytic platform with lambda architecture on cloud
Scott Miao
 
SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)
wqchen
 
Hadoop con2016 - Implement Real-time Centralized logging System by Elastic Stack
Hadoop con2016 - Implement Real-time Centralized logging System by Elastic StackHadoop con2016 - Implement Real-time Centralized logging System by Elastic Stack
Hadoop con2016 - Implement Real-time Centralized logging System by Elastic Stack
Len Chang
 
Ad

Similar to Apache Flink Training Workshop @ HadoopCon2016 - #2 DataSet API Hands-On (20)

How to make data available for analytics ASAP
How to make data available for analytics ASAPHow to make data available for analytics ASAP
How to make data available for analytics ASAP
MariaDB plc
 
ML Infra for Netflix Recommendations - AI NEXTCon talk
ML Infra for Netflix Recommendations - AI NEXTCon talkML Infra for Netflix Recommendations - AI NEXTCon talk
ML Infra for Netflix Recommendations - AI NEXTCon talk
Faisal Siddiqi
 
Netflix Machine Learning Infra for Recommendations - 2018
Netflix Machine Learning Infra for Recommendations - 2018Netflix Machine Learning Infra for Recommendations - 2018
Netflix Machine Learning Infra for Recommendations - 2018
Karthik Murugesan
 
Semmle Codeql
Semmle Codeql Semmle Codeql
Semmle Codeql
M. S.
 
Introduction to Apache Tajo: Data Warehouse for Big Data
Introduction to Apache Tajo: Data Warehouse for Big DataIntroduction to Apache Tajo: Data Warehouse for Big Data
Introduction to Apache Tajo: Data Warehouse for Big Data
Jihoon Son
 
Data pipelines from zero to solid
Data pipelines from zero to solidData pipelines from zero to solid
Data pipelines from zero to solid
Lars Albertsson
 
BUD17-302: LLVM Internals #2
BUD17-302: LLVM Internals #2 BUD17-302: LLVM Internals #2
BUD17-302: LLVM Internals #2
Linaro
 
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...
AMD Developer Central
 
Gearpump akka streams
Gearpump akka streamsGearpump akka streams
Gearpump akka streams
Kam Kasravi
 
Fast federated SQL with Apache Calcite
Fast federated SQL with Apache CalciteFast federated SQL with Apache Calcite
Fast federated SQL with Apache Calcite
Chris Baynes
 
M|18 Ingesting Data with the New Bulk Data Adapters
M|18 Ingesting Data with the New Bulk Data AdaptersM|18 Ingesting Data with the New Bulk Data Adapters
M|18 Ingesting Data with the New Bulk Data Adapters
MariaDB plc
 
Sorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at SpotifySorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at Spotify
Neville Li
 
Titan and Cassandra at WellAware
Titan and Cassandra at WellAwareTitan and Cassandra at WellAware
Titan and Cassandra at WellAware
twilmes
 
Test strategies for data processing pipelines, v2.0
Test strategies for data processing pipelines, v2.0Test strategies for data processing pipelines, v2.0
Test strategies for data processing pipelines, v2.0
Lars Albertsson
 
from Binary to Binary: How Qemu Works
from Binary to Binary: How Qemu Worksfrom Binary to Binary: How Qemu Works
from Binary to Binary: How Qemu Works
Zhen Wei
 
Unifying Frontend and Backend Development with Scala - ScalaCon 2021
Unifying Frontend and Backend Development with Scala - ScalaCon 2021Unifying Frontend and Backend Development with Scala - ScalaCon 2021
Unifying Frontend and Backend Development with Scala - ScalaCon 2021
Taro L. Saito
 
What's New in MariaDB Server 10.3
What's New in MariaDB Server 10.3What's New in MariaDB Server 10.3
What's New in MariaDB Server 10.3
MariaDB plc
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
Databricks
 
Google V8 engine
Google V8 engineGoogle V8 engine
Google V8 engine
Xuân Thu Nguyễn
 
MariaDB Server 10.3 - Temporale Daten und neues zur DB-Kompatibilität
MariaDB Server 10.3 - Temporale Daten und neues zur DB-KompatibilitätMariaDB Server 10.3 - Temporale Daten und neues zur DB-Kompatibilität
MariaDB Server 10.3 - Temporale Daten und neues zur DB-Kompatibilität
MariaDB plc
 
How to make data available for analytics ASAP
How to make data available for analytics ASAPHow to make data available for analytics ASAP
How to make data available for analytics ASAP
MariaDB plc
 
ML Infra for Netflix Recommendations - AI NEXTCon talk
ML Infra for Netflix Recommendations - AI NEXTCon talkML Infra for Netflix Recommendations - AI NEXTCon talk
ML Infra for Netflix Recommendations - AI NEXTCon talk
Faisal Siddiqi
 
Netflix Machine Learning Infra for Recommendations - 2018
Netflix Machine Learning Infra for Recommendations - 2018Netflix Machine Learning Infra for Recommendations - 2018
Netflix Machine Learning Infra for Recommendations - 2018
Karthik Murugesan
 
Semmle Codeql
Semmle Codeql Semmle Codeql
Semmle Codeql
M. S.
 
Introduction to Apache Tajo: Data Warehouse for Big Data
Introduction to Apache Tajo: Data Warehouse for Big DataIntroduction to Apache Tajo: Data Warehouse for Big Data
Introduction to Apache Tajo: Data Warehouse for Big Data
Jihoon Son
 
Data pipelines from zero to solid
Data pipelines from zero to solidData pipelines from zero to solid
Data pipelines from zero to solid
Lars Albertsson
 
BUD17-302: LLVM Internals #2
BUD17-302: LLVM Internals #2 BUD17-302: LLVM Internals #2
BUD17-302: LLVM Internals #2
Linaro
 
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...
AMD Developer Central
 
Gearpump akka streams
Gearpump akka streamsGearpump akka streams
Gearpump akka streams
Kam Kasravi
 
Fast federated SQL with Apache Calcite
Fast federated SQL with Apache CalciteFast federated SQL with Apache Calcite
Fast federated SQL with Apache Calcite
Chris Baynes
 
M|18 Ingesting Data with the New Bulk Data Adapters
M|18 Ingesting Data with the New Bulk Data AdaptersM|18 Ingesting Data with the New Bulk Data Adapters
M|18 Ingesting Data with the New Bulk Data Adapters
MariaDB plc
 
Sorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at SpotifySorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at Spotify
Neville Li
 
Titan and Cassandra at WellAware
Titan and Cassandra at WellAwareTitan and Cassandra at WellAware
Titan and Cassandra at WellAware
twilmes
 
Test strategies for data processing pipelines, v2.0
Test strategies for data processing pipelines, v2.0Test strategies for data processing pipelines, v2.0
Test strategies for data processing pipelines, v2.0
Lars Albertsson
 
from Binary to Binary: How Qemu Works
from Binary to Binary: How Qemu Worksfrom Binary to Binary: How Qemu Works
from Binary to Binary: How Qemu Works
Zhen Wei
 
Unifying Frontend and Backend Development with Scala - ScalaCon 2021
Unifying Frontend and Backend Development with Scala - ScalaCon 2021Unifying Frontend and Backend Development with Scala - ScalaCon 2021
Unifying Frontend and Backend Development with Scala - ScalaCon 2021
Taro L. Saito
 
What's New in MariaDB Server 10.3
What's New in MariaDB Server 10.3What's New in MariaDB Server 10.3
What's New in MariaDB Server 10.3
MariaDB plc
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
Databricks
 
MariaDB Server 10.3 - Temporale Daten und neues zur DB-Kompatibilität
MariaDB Server 10.3 - Temporale Daten und neues zur DB-KompatibilitätMariaDB Server 10.3 - Temporale Daten und neues zur DB-Kompatibilität
MariaDB Server 10.3 - Temporale Daten und neues zur DB-Kompatibilität
MariaDB plc
 
Ad

Recently uploaded (20)

Adopting Process Mining at the Rabobank - use case
Adopting Process Mining at the Rabobank - use caseAdopting Process Mining at the Rabobank - use case
Adopting Process Mining at the Rabobank - use case
Process mining Evangelist
 
Understanding Complex Development Processes
Understanding Complex Development ProcessesUnderstanding Complex Development Processes
Understanding Complex Development Processes
Process mining Evangelist
 
50_questions_full.pptxdddddddddddddddddd
50_questions_full.pptxdddddddddddddddddd50_questions_full.pptxdddddddddddddddddd
50_questions_full.pptxdddddddddddddddddd
emir73065
 
What is ETL? Difference between ETL and ELT?.pdf
What is ETL? Difference between ETL and ELT?.pdfWhat is ETL? Difference between ETL and ELT?.pdf
What is ETL? Difference between ETL and ELT?.pdf
SaikatBasu37
 
AWS RDS Presentation to make concepts easy.pptx
AWS RDS Presentation to make concepts easy.pptxAWS RDS Presentation to make concepts easy.pptx
AWS RDS Presentation to make concepts easy.pptx
bharatkumarbhojwani
 
新西兰文凭奥克兰理工大学毕业证书AUT成绩单补办
新西兰文凭奥克兰理工大学毕业证书AUT成绩单补办新西兰文凭奥克兰理工大学毕业证书AUT成绩单补办
新西兰文凭奥克兰理工大学毕业证书AUT成绩单补办
Taqyea
 
2024-Media-Literacy-Index-Of-Ukrainians-ENG-SHORT.pdf
2024-Media-Literacy-Index-Of-Ukrainians-ENG-SHORT.pdf2024-Media-Literacy-Index-Of-Ukrainians-ENG-SHORT.pdf
2024-Media-Literacy-Index-Of-Ukrainians-ENG-SHORT.pdf
OlhaTatokhina1
 
RAG Chatbot using AWS Bedrock and Streamlit Framework
RAG Chatbot using AWS Bedrock and Streamlit FrameworkRAG Chatbot using AWS Bedrock and Streamlit Framework
RAG Chatbot using AWS Bedrock and Streamlit Framework
apanneer
 
TOAE201-Slides-Chapter 4. Sample theoretical basis (1).pdf
TOAE201-Slides-Chapter 4. Sample theoretical basis (1).pdfTOAE201-Slides-Chapter 4. Sample theoretical basis (1).pdf
TOAE201-Slides-Chapter 4. Sample theoretical basis (1).pdf
NhiV747372
 
Process Mining at Dimension Data - Jan vermeulen
Process Mining at Dimension Data - Jan vermeulenProcess Mining at Dimension Data - Jan vermeulen
Process Mining at Dimension Data - Jan vermeulen
Process mining Evangelist
 
L1_Slides_Foundational Concepts_508.pptx
L1_Slides_Foundational Concepts_508.pptxL1_Slides_Foundational Concepts_508.pptx
L1_Slides_Foundational Concepts_508.pptx
38NoopurPatel
 
HershAggregator (2).pdf musicretaildistribution
HershAggregator (2).pdf musicretaildistributionHershAggregator (2).pdf musicretaildistribution
HershAggregator (2).pdf musicretaildistribution
hershtara1
 
Analysis of Billboards hot 100 toop five hit makers on the chart.docx
Analysis of Billboards hot 100 toop five hit makers on the chart.docxAnalysis of Billboards hot 100 toop five hit makers on the chart.docx
Analysis of Billboards hot 100 toop five hit makers on the chart.docx
hershtara1
 
录取通知书加拿大TMU毕业证多伦多都会大学电子版毕业证成绩单
录取通知书加拿大TMU毕业证多伦多都会大学电子版毕业证成绩单录取通知书加拿大TMU毕业证多伦多都会大学电子版毕业证成绩单
录取通知书加拿大TMU毕业证多伦多都会大学电子版毕业证成绩单
Taqyea
 
Voice Control robotic arm hggyghghgjgjhgjg
Voice Control robotic arm hggyghghgjgjhgjgVoice Control robotic arm hggyghghgjgjhgjg
Voice Control robotic arm hggyghghgjgjhgjg
4mg22ec401
 
AWS Certified Machine Learning Slides.pdf
AWS Certified Machine Learning Slides.pdfAWS Certified Machine Learning Slides.pdf
AWS Certified Machine Learning Slides.pdf
philsparkshome
 
Process Mining as Enabler for Digital Transformations
Process Mining as Enabler for Digital TransformationsProcess Mining as Enabler for Digital Transformations
Process Mining as Enabler for Digital Transformations
Process mining Evangelist
 
Transforming health care with ai powered
Transforming health care with ai poweredTransforming health care with ai powered
Transforming health care with ai powered
gowthamarvj
 
Z14_IBM__APL_by_Christian_Demmer_IBM.pdf
Z14_IBM__APL_by_Christian_Demmer_IBM.pdfZ14_IBM__APL_by_Christian_Demmer_IBM.pdf
Z14_IBM__APL_by_Christian_Demmer_IBM.pdf
Fariborz Seyedloo
 
Feature Engineering for Electronic Health Record Systems
Feature Engineering for Electronic Health Record SystemsFeature Engineering for Electronic Health Record Systems
Feature Engineering for Electronic Health Record Systems
Process mining Evangelist
 
Adopting Process Mining at the Rabobank - use case
Adopting Process Mining at the Rabobank - use caseAdopting Process Mining at the Rabobank - use case
Adopting Process Mining at the Rabobank - use case
Process mining Evangelist
 
50_questions_full.pptxdddddddddddddddddd
50_questions_full.pptxdddddddddddddddddd50_questions_full.pptxdddddddddddddddddd
50_questions_full.pptxdddddddddddddddddd
emir73065
 
What is ETL? Difference between ETL and ELT?.pdf
What is ETL? Difference between ETL and ELT?.pdfWhat is ETL? Difference between ETL and ELT?.pdf
What is ETL? Difference between ETL and ELT?.pdf
SaikatBasu37
 
AWS RDS Presentation to make concepts easy.pptx
AWS RDS Presentation to make concepts easy.pptxAWS RDS Presentation to make concepts easy.pptx
AWS RDS Presentation to make concepts easy.pptx
bharatkumarbhojwani
 
新西兰文凭奥克兰理工大学毕业证书AUT成绩单补办
新西兰文凭奥克兰理工大学毕业证书AUT成绩单补办新西兰文凭奥克兰理工大学毕业证书AUT成绩单补办
新西兰文凭奥克兰理工大学毕业证书AUT成绩单补办
Taqyea
 
2024-Media-Literacy-Index-Of-Ukrainians-ENG-SHORT.pdf
2024-Media-Literacy-Index-Of-Ukrainians-ENG-SHORT.pdf2024-Media-Literacy-Index-Of-Ukrainians-ENG-SHORT.pdf
2024-Media-Literacy-Index-Of-Ukrainians-ENG-SHORT.pdf
OlhaTatokhina1
 
RAG Chatbot using AWS Bedrock and Streamlit Framework
RAG Chatbot using AWS Bedrock and Streamlit FrameworkRAG Chatbot using AWS Bedrock and Streamlit Framework
RAG Chatbot using AWS Bedrock and Streamlit Framework
apanneer
 
TOAE201-Slides-Chapter 4. Sample theoretical basis (1).pdf
TOAE201-Slides-Chapter 4. Sample theoretical basis (1).pdfTOAE201-Slides-Chapter 4. Sample theoretical basis (1).pdf
TOAE201-Slides-Chapter 4. Sample theoretical basis (1).pdf
NhiV747372
 
Process Mining at Dimension Data - Jan vermeulen
Process Mining at Dimension Data - Jan vermeulenProcess Mining at Dimension Data - Jan vermeulen
Process Mining at Dimension Data - Jan vermeulen
Process mining Evangelist
 
L1_Slides_Foundational Concepts_508.pptx
L1_Slides_Foundational Concepts_508.pptxL1_Slides_Foundational Concepts_508.pptx
L1_Slides_Foundational Concepts_508.pptx
38NoopurPatel
 
HershAggregator (2).pdf musicretaildistribution
HershAggregator (2).pdf musicretaildistributionHershAggregator (2).pdf musicretaildistribution
HershAggregator (2).pdf musicretaildistribution
hershtara1
 
Analysis of Billboards hot 100 toop five hit makers on the chart.docx
Analysis of Billboards hot 100 toop five hit makers on the chart.docxAnalysis of Billboards hot 100 toop five hit makers on the chart.docx
Analysis of Billboards hot 100 toop five hit makers on the chart.docx
hershtara1
 
录取通知书加拿大TMU毕业证多伦多都会大学电子版毕业证成绩单
录取通知书加拿大TMU毕业证多伦多都会大学电子版毕业证成绩单录取通知书加拿大TMU毕业证多伦多都会大学电子版毕业证成绩单
录取通知书加拿大TMU毕业证多伦多都会大学电子版毕业证成绩单
Taqyea
 
Voice Control robotic arm hggyghghgjgjhgjg
Voice Control robotic arm hggyghghgjgjhgjgVoice Control robotic arm hggyghghgjgjhgjg
Voice Control robotic arm hggyghghgjgjhgjg
4mg22ec401
 
AWS Certified Machine Learning Slides.pdf
AWS Certified Machine Learning Slides.pdfAWS Certified Machine Learning Slides.pdf
AWS Certified Machine Learning Slides.pdf
philsparkshome
 
Process Mining as Enabler for Digital Transformations
Process Mining as Enabler for Digital TransformationsProcess Mining as Enabler for Digital Transformations
Process Mining as Enabler for Digital Transformations
Process mining Evangelist
 
Transforming health care with ai powered
Transforming health care with ai poweredTransforming health care with ai powered
Transforming health care with ai powered
gowthamarvj
 
Z14_IBM__APL_by_Christian_Demmer_IBM.pdf
Z14_IBM__APL_by_Christian_Demmer_IBM.pdfZ14_IBM__APL_by_Christian_Demmer_IBM.pdf
Z14_IBM__APL_by_Christian_Demmer_IBM.pdf
Fariborz Seyedloo
 
Feature Engineering for Electronic Health Record Systems
Feature Engineering for Electronic Health Record SystemsFeature Engineering for Electronic Health Record Systems
Feature Engineering for Electronic Health Record Systems
Process mining Evangelist
 

Apache Flink Training Workshop @ HadoopCon2016 - #2 DataSet API Hands-On

  • 2. Who Am I? ● Newbie in Apache Flink ● BlueWell Technology ○ Big Data Architect ○ Focuses ■ Open DC/OS ■ CoreOS ■ Kubernetes ■ Apache Flink ■ Data Science
  • 3. Agenda ● Apache Flink Type System ○ Atomic ○ Composite ○ Tuple ○ POJO ○ Scala case class ● Transformations ○ Transformations on DataSet ○ Rich Functions ○ Accumulators & Counters ○ Annotations
  • 4. Agenda ● Iterations ○ Bulk iterations ○ Delta iterations
  • 5. Basic Structure Of Apache Flink Programs ● For each Apache Flink Program, the basic structure is listed as follows. ○ Obtain an execution environment. ■ ExecutionEnvironment.getExecutionEnvironment() ○ Load/create DataSets from data sources. ■ readFile(), readTextFile(), readCsvFile(), etc. ■ fromElements(), fromCollection(), etc. ○ Execute some transformations on the DataSets. ■ filter(), map(), reduce(), etc. ○ Specify where to save results of the computations. ■ write(), writeAsCsv(), etc. ■ collect(), print(), etc. ○ Trigger the program execution. Hands-on BasicStructure
  • 6. Apache Flink Type System ● Flink attempts to support all data types ○ Facilitate programming ○ Seamlessly integrate with legacy code ● Flink analyzes applications before execution for ○ Identifying used data types ○ Determining serializers and comparators ● Data types could be ○ Atomic data types ○ Composite data types
  • 7. Composite Data Types ● Composite Data Types include ○ Tuples ■ In Java ■ In Scala ○ POJOs ○ Scala case class
  • 8. Tuple Data Types ● Flink supports Tuple in ○ Java: org.apache.flink.api.java.tuple.Tuple<n> ■ n = 1, …, 25 ○ Scala: premitive Tuple<n> ■ n = 1, …, 22 ● Key declarations ○ Field index ○ E.g., dataset.groupBy(0).sum(1) ○ E.g., dataset.groupBy(“_1”).sum(“_2”)
  • 9. POJOs ● POJOs – A java class with ○ A default constructor without parameters. ○ All fields are ■ public or ■ private but have getter & setter ■ Ex. public class Car { public int id; public String brand; public Car() {}; public Car(int id, String brand) {…}; }
  • 10. POJOs ● Key Declarations ○ Field name as a string ○ E.g., cars.groupBy(“brand”)
  • 11. Scala Case Class ● Primitive Scala case classes are supported ○ E.g., case class Car(id: Int, brand: String) ● Key declarations ○ Field name as a string ■ E.g., cars.groupBy(“brand”) ○ Field name ■ E.g., cars.groupBy(_.brand) Hands-on TypeSystem
  • 13. Introduction To Rich Functions ● Purpose ○ Implement complicated user-defined functions ○ Broadcast variables ○ Access to Accumulators & Counters ● Structure ○ open(): initialization, one-shot setup work. ○ close(): tear-down, clean-up work. ○ get/setRuntimeContext(): access to RuntimeContext. ○ corresponding transformation method, e.g., map, join, etc.
  • 14. Broadcast Variables ● Register ○ dataset.map(new RichMapFunction()) .withBroadcastSet(toBroadcast, “varName”) ● Access in Rich Functions ○ Initialize the broadcasted variables in open() via ○ getRuntimeContext().getBroadcastVariable(“varName”) ○ Access them in the whole class scope.
  • 15. Accumulators & Counters ● Purpose ○ Debugging ○ First glance of DataSets ● Counters are kinds of accumulator ● Structure ○ An add operation ○ A final accumulated result (available after the job ended) ● Flink will automatically sum up all partial results.
  • 16. Accumulators & Counters ● Built-in Accumulators ○ IntCounter, LongCounter, DoubleCounter ○ Histogram: map from integer to integer, distributions ● Register ○ new IntCounter() ○ getRuntimeContext().addAccumulator(“accuName”, counter) ● Access ○ In Rich Functions ■ getRuntimeContext().getAccumulator(“accuName”) ○ In the end of job ■ JobExecutionResult.getAccumulatorResult(“accuName")
  • 17. Semantic Annotations ● Give Flink hints about the behavior of a function ○ A powerful means to speed up execution ○ Reusing sort orders or partitions across multiple operations ○ Prevent programs from unnecessary data shuffling or unnecessary sorts ● Types of Annotation ○ Forwarded fields annotations (@ForwardedFields) ○ Non-forwarded fields annotations (@NonForwardedFields) ■ Black or White in place ○ Read fields annotations (@ReadFields) ○ Fields to be read and evaluated
  • 18. Iterations ● Bulk iterations ○ Partial Solution ○ Iteration Result
  • 19. Iterations ● Delta Iterations ○ Workset / Update Solution Set ○ Iteration Result
  翻译: