SlideShare a Scribd company logo
Basic
Using Spark DataFrame
For SQL
charsyam@naver.com
Create DataFrame From File
val path = “abc.txt”
val df = spark.read.text(path)
Create DataFrame From Kafka
val rdd = KafkaUtils.createRDD[String, String](...)
val logsDF = rdd.map { _.value }.toDF
Spark DataFrame Column
1) col("column name")
2) $"column name"
1) And 2) are the same.
Simple Iris TSV Logs
http://www.math.uah.edu/stat/data/Fisher.txt
Type PW PL SW SL
0 2 14 33 50
1 24 56 31 67
1 23 51 31 69
0 2 10 36 46
1 20 52 30 65
1 19 51 27 58
Load TSV with StructType
import org.apache.spark.sql.types._
var irisSchema = StructType(Array(
StructField("Type", IntegerType, true),
StructField("PetalWidth", IntegerType, true),
StructField("PetalLength", IntegerType, true),
StructField("SepalWidth", IntegerType, true),
StructField("SepalLength", IntegerType, true)
))
Load TSV with Encoder #1
import org.apache.spark.sql.Encoders
case class IrisSchema(Type: Int, PetalWidth: Int, PetalLength: Int,
SepalWidth: Int, SepalLength: Int)
var irisSchema = Encoders.product[IrisSchema].schema
Load TSV
var irisDf = spark.read.format("csv"). // Use "csv" regardless of TSV or CSV.
option("header", "true"). // Does the file have a header line?
option("delimiter", "t"). // Set delimiter to tab or comma.
schema(irisSchema). // Schema that was built above.
load("Fisher.txt")
irisDf.show(5)
Load TSV - Show Results
scala> irisDf.show(5)
+----+----------+-----------+----------+-----------+
|Type|PetalWidth|PetalLength|SepalWidth|SepalLength|
+----+----------+-----------+----------+-----------+
| 0| 2| 14| 33| 50|
| 1| 24| 56| 31| 67|
| 1| 23| 51| 31| 69|
| 0| 2| 10| 36| 46|
| 1| 20| 52| 30| 65|
+----+----------+-----------+----------+-----------+
only showing top 5 rows
Using sqlContext sql
Super easy way
val view = df.createOrReplaceTempView("tmp_iris")
val resultDF = df.sqlContext.sql("select type, PetalWidth from tmp_iris")
Simple Select
SQL:
Select type, petalwidth + sepalwidth as sum_width from …
val sumDF = df.withColumn("sum_width", col("PetalWidth") + col("SepalWidth"))
val resultDF = sumDF.selectExpr("Type", "sum_width")
val resultDF = sumDF.selectExpr("*") ← select *
Select with where
SQL:
Select type, petalwidth from … where petalwidth > 10
val whereDF = df.filter($"petalwidth" > 10)
val whereDF = df.where($"petalwidth" > 10)
//filter and where are the same
val resultDF = whereDF.selectExpr("Type", "petalwidth")
Select with order by
SQL:
Select petalwidth, sepalwidth from … order by petalwidth, sepalwidth desc
1) val sortDF = df.sort($"petalwidth", $"sepalwidth".desc)
2) val sortDF = df.sort($"petalwidth", desc("sepalwidth"))
3) val sortDF = df.orderBy($"petalwidth", desc("sepalwidth"))
1), 2) And 3) are the same.
val resultDF = sortDF.selectExpr("petalwidth", "sepalwidth")
Select with Group by
SQL:
Select type, max(petalwidth) A, min(sepalwidth) B from … group by type
val groupDF = df.groupBy($"type").agg(max($"petalwidth").as("A"),
min($"sepalwidth").as("B"))
val resultDF = groupDF.selectExpr("type", "A", "B")
Tip - Support MapType<String, String> like Hive
SQL in Hive:
Create table test (type map<string, string>);
Hive support str_to_map, but spark not support for dataframe(spark support
str_to_map for hiveQL).
Using udf to solve this.
val string_line = "A=1,B=2,C=3"
Val df = logsDF.withColumn("type", str_to_map(string_line))
UDF - str_to_map
val str_to_map = udf {
text : String =>
val pairs = text.split("delimiter1|delimiter2").grouped(2)
pairs.map { case Array(k, v) => k -> v}.toMap
}
Thank you.
Ad

More Related Content

What's hot (20)

Format xls sheets Demo Mode
Format xls sheets Demo ModeFormat xls sheets Demo Mode
Format xls sheets Demo Mode
Jared Bourne
 
The Ring programming language version 1.6 book - Part 32 of 189
The Ring programming language version 1.6 book - Part 32 of 189The Ring programming language version 1.6 book - Part 32 of 189
The Ring programming language version 1.6 book - Part 32 of 189
Mahmoud Samir Fayed
 
The Ring programming language version 1.2 book - Part 19 of 84
The Ring programming language version 1.2 book - Part 19 of 84The Ring programming language version 1.2 book - Part 19 of 84
The Ring programming language version 1.2 book - Part 19 of 84
Mahmoud Samir Fayed
 
SICP_2.5 일반화된 연산시스템
SICP_2.5 일반화된 연산시스템SICP_2.5 일반화된 연산시스템
SICP_2.5 일반화된 연산시스템
HyeonSeok Choi
 
The Ring programming language version 1.10 book - Part 47 of 212
The Ring programming language version 1.10 book - Part 47 of 212The Ring programming language version 1.10 book - Part 47 of 212
The Ring programming language version 1.10 book - Part 47 of 212
Mahmoud Samir Fayed
 
The Ring programming language version 1.4.1 book - Part 13 of 31
The Ring programming language version 1.4.1 book - Part 13 of 31The Ring programming language version 1.4.1 book - Part 13 of 31
The Ring programming language version 1.4.1 book - Part 13 of 31
Mahmoud Samir Fayed
 
JSON Support in MariaDB: News, non-news and the bigger picture
JSON Support in MariaDB: News, non-news and the bigger pictureJSON Support in MariaDB: News, non-news and the bigger picture
JSON Support in MariaDB: News, non-news and the bigger picture
Sergey Petrunya
 
Rule Your Geometry with the Terraformer Toolkit
Rule Your Geometry with the Terraformer ToolkitRule Your Geometry with the Terraformer Toolkit
Rule Your Geometry with the Terraformer Toolkit
Aaron Parecki
 
Get docs from sp doc library
Get docs from sp doc libraryGet docs from sp doc library
Get docs from sp doc library
Sudip Sengupta
 
GreenDao Introduction
GreenDao IntroductionGreenDao Introduction
GreenDao Introduction
Booch Lin
 
The Ring programming language version 1.7 book - Part 41 of 196
The Ring programming language version 1.7 book - Part 41 of 196The Ring programming language version 1.7 book - Part 41 of 196
The Ring programming language version 1.7 book - Part 41 of 196
Mahmoud Samir Fayed
 
Memory management
Memory managementMemory management
Memory management
Kuban Dzhakipov
 
The Ring programming language version 1.7 book - Part 48 of 196
The Ring programming language version 1.7 book - Part 48 of 196The Ring programming language version 1.7 book - Part 48 of 196
The Ring programming language version 1.7 book - Part 48 of 196
Mahmoud Samir Fayed
 
Node js mongodriver
Node js mongodriverNode js mongodriver
Node js mongodriver
christkv
 
The Ring programming language version 1.5.3 book - Part 30 of 184
The Ring programming language version 1.5.3 book - Part 30 of 184The Ring programming language version 1.5.3 book - Part 30 of 184
The Ring programming language version 1.5.3 book - Part 30 of 184
Mahmoud Samir Fayed
 
The Ring programming language version 1.9 book - Part 46 of 210
The Ring programming language version 1.9 book - Part 46 of 210The Ring programming language version 1.9 book - Part 46 of 210
The Ring programming language version 1.9 book - Part 46 of 210
Mahmoud Samir Fayed
 
Slick: Bringing Scala’s Powerful Features to Your Database Access
Slick: Bringing Scala’s Powerful Features to Your Database Access Slick: Bringing Scala’s Powerful Features to Your Database Access
Slick: Bringing Scala’s Powerful Features to Your Database Access
Rebecca Grenier
 
The Ring programming language version 1.5 book - Part 8 of 31
The Ring programming language version 1.5 book - Part 8 of 31The Ring programming language version 1.5 book - Part 8 of 31
The Ring programming language version 1.5 book - Part 8 of 31
Mahmoud Samir Fayed
 
The Ring programming language version 1.5.3 book - Part 37 of 184
The Ring programming language version 1.5.3 book - Part 37 of 184The Ring programming language version 1.5.3 book - Part 37 of 184
The Ring programming language version 1.5.3 book - Part 37 of 184
Mahmoud Samir Fayed
 
Odoo Technical Concepts Summary
Odoo Technical Concepts SummaryOdoo Technical Concepts Summary
Odoo Technical Concepts Summary
Mohamed Magdy
 
Format xls sheets Demo Mode
Format xls sheets Demo ModeFormat xls sheets Demo Mode
Format xls sheets Demo Mode
Jared Bourne
 
The Ring programming language version 1.6 book - Part 32 of 189
The Ring programming language version 1.6 book - Part 32 of 189The Ring programming language version 1.6 book - Part 32 of 189
The Ring programming language version 1.6 book - Part 32 of 189
Mahmoud Samir Fayed
 
The Ring programming language version 1.2 book - Part 19 of 84
The Ring programming language version 1.2 book - Part 19 of 84The Ring programming language version 1.2 book - Part 19 of 84
The Ring programming language version 1.2 book - Part 19 of 84
Mahmoud Samir Fayed
 
SICP_2.5 일반화된 연산시스템
SICP_2.5 일반화된 연산시스템SICP_2.5 일반화된 연산시스템
SICP_2.5 일반화된 연산시스템
HyeonSeok Choi
 
The Ring programming language version 1.10 book - Part 47 of 212
The Ring programming language version 1.10 book - Part 47 of 212The Ring programming language version 1.10 book - Part 47 of 212
The Ring programming language version 1.10 book - Part 47 of 212
Mahmoud Samir Fayed
 
The Ring programming language version 1.4.1 book - Part 13 of 31
The Ring programming language version 1.4.1 book - Part 13 of 31The Ring programming language version 1.4.1 book - Part 13 of 31
The Ring programming language version 1.4.1 book - Part 13 of 31
Mahmoud Samir Fayed
 
JSON Support in MariaDB: News, non-news and the bigger picture
JSON Support in MariaDB: News, non-news and the bigger pictureJSON Support in MariaDB: News, non-news and the bigger picture
JSON Support in MariaDB: News, non-news and the bigger picture
Sergey Petrunya
 
Rule Your Geometry with the Terraformer Toolkit
Rule Your Geometry with the Terraformer ToolkitRule Your Geometry with the Terraformer Toolkit
Rule Your Geometry with the Terraformer Toolkit
Aaron Parecki
 
Get docs from sp doc library
Get docs from sp doc libraryGet docs from sp doc library
Get docs from sp doc library
Sudip Sengupta
 
GreenDao Introduction
GreenDao IntroductionGreenDao Introduction
GreenDao Introduction
Booch Lin
 
The Ring programming language version 1.7 book - Part 41 of 196
The Ring programming language version 1.7 book - Part 41 of 196The Ring programming language version 1.7 book - Part 41 of 196
The Ring programming language version 1.7 book - Part 41 of 196
Mahmoud Samir Fayed
 
The Ring programming language version 1.7 book - Part 48 of 196
The Ring programming language version 1.7 book - Part 48 of 196The Ring programming language version 1.7 book - Part 48 of 196
The Ring programming language version 1.7 book - Part 48 of 196
Mahmoud Samir Fayed
 
Node js mongodriver
Node js mongodriverNode js mongodriver
Node js mongodriver
christkv
 
The Ring programming language version 1.5.3 book - Part 30 of 184
The Ring programming language version 1.5.3 book - Part 30 of 184The Ring programming language version 1.5.3 book - Part 30 of 184
The Ring programming language version 1.5.3 book - Part 30 of 184
Mahmoud Samir Fayed
 
The Ring programming language version 1.9 book - Part 46 of 210
The Ring programming language version 1.9 book - Part 46 of 210The Ring programming language version 1.9 book - Part 46 of 210
The Ring programming language version 1.9 book - Part 46 of 210
Mahmoud Samir Fayed
 
Slick: Bringing Scala’s Powerful Features to Your Database Access
Slick: Bringing Scala’s Powerful Features to Your Database Access Slick: Bringing Scala’s Powerful Features to Your Database Access
Slick: Bringing Scala’s Powerful Features to Your Database Access
Rebecca Grenier
 
The Ring programming language version 1.5 book - Part 8 of 31
The Ring programming language version 1.5 book - Part 8 of 31The Ring programming language version 1.5 book - Part 8 of 31
The Ring programming language version 1.5 book - Part 8 of 31
Mahmoud Samir Fayed
 
The Ring programming language version 1.5.3 book - Part 37 of 184
The Ring programming language version 1.5.3 book - Part 37 of 184The Ring programming language version 1.5.3 book - Part 37 of 184
The Ring programming language version 1.5.3 book - Part 37 of 184
Mahmoud Samir Fayed
 
Odoo Technical Concepts Summary
Odoo Technical Concepts SummaryOdoo Technical Concepts Summary
Odoo Technical Concepts Summary
Mohamed Magdy
 

Similar to Using spark data frame for sql (20)

Solr As A SparkSQL DataSource
Solr As A SparkSQL DataSourceSolr As A SparkSQL DataSource
Solr As A SparkSQL DataSource
Spark Summit
 
ScalikeJDBC Tutorial for Beginners
ScalikeJDBC Tutorial for BeginnersScalikeJDBC Tutorial for Beginners
ScalikeJDBC Tutorial for Beginners
Kazuhiro Sera
 
Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C*...
Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C*...Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C*...
Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C*...
DataStax
 
Using Spark to Load Oracle Data into Cassandra
Using Spark to Load Oracle Data into CassandraUsing Spark to Load Oracle Data into Cassandra
Using Spark to Load Oracle Data into Cassandra
Jim Hatcher
 
Big Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIs
Big Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIsBig Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIs
Big Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIs
Matt Stubbs
 
Introduction to Spark with Scala
Introduction to Spark with ScalaIntroduction to Spark with Scala
Introduction to Spark with Scala
Himanshu Gupta
 
Meetup spark structured streaming
Meetup spark structured streamingMeetup spark structured streaming
Meetup spark structured streaming
José Carlos García Serrano
 
[QE 2018] Łukasz Gawron – Testing Batch and Streaming Spark Applications
[QE 2018] Łukasz Gawron – Testing Batch and Streaming Spark Applications[QE 2018] Łukasz Gawron – Testing Batch and Streaming Spark Applications
[QE 2018] Łukasz Gawron – Testing Batch and Streaming Spark Applications
Future Processing
 
Testing batch and streaming Spark applications
Testing batch and streaming Spark applicationsTesting batch and streaming Spark applications
Testing batch and streaming Spark applications
Łukasz Gawron
 
3 database-jdbc(1)
3 database-jdbc(1)3 database-jdbc(1)
3 database-jdbc(1)
hameedkhan2017
 
Spark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted MalaskaSpark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted Malaska
Spark Summit
 
User Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love StoryUser Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love Story
Databricks
 
User Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love StoryUser Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love Story
Databricks
 
Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013
Samir Bessalah
 
Spark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard MaasSpark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Summit
 
The Ring programming language version 1.5.4 book - Part 37 of 185
The Ring programming language version 1.5.4 book - Part 37 of 185The Ring programming language version 1.5.4 book - Part 37 of 185
The Ring programming language version 1.5.4 book - Part 37 of 185
Mahmoud Samir Fayed
 
SparkSQLの構文解析
SparkSQLの構文解析SparkSQLの構文解析
SparkSQLの構文解析
ゆり 井上
 
The Ring programming language version 1.5.3 book - Part 54 of 184
The Ring programming language version 1.5.3 book - Part 54 of 184The Ring programming language version 1.5.3 book - Part 54 of 184
The Ring programming language version 1.5.3 book - Part 54 of 184
Mahmoud Samir Fayed
 
The Ring programming language version 1.5.3 book - Part 44 of 184
The Ring programming language version 1.5.3 book - Part 44 of 184The Ring programming language version 1.5.3 book - Part 44 of 184
The Ring programming language version 1.5.3 book - Part 44 of 184
Mahmoud Samir Fayed
 
Scala in Places API
Scala in Places APIScala in Places API
Scala in Places API
Łukasz Bałamut
 
Solr As A SparkSQL DataSource
Solr As A SparkSQL DataSourceSolr As A SparkSQL DataSource
Solr As A SparkSQL DataSource
Spark Summit
 
ScalikeJDBC Tutorial for Beginners
ScalikeJDBC Tutorial for BeginnersScalikeJDBC Tutorial for Beginners
ScalikeJDBC Tutorial for Beginners
Kazuhiro Sera
 
Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C*...
Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C*...Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C*...
Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C*...
DataStax
 
Using Spark to Load Oracle Data into Cassandra
Using Spark to Load Oracle Data into CassandraUsing Spark to Load Oracle Data into Cassandra
Using Spark to Load Oracle Data into Cassandra
Jim Hatcher
 
Big Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIs
Big Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIsBig Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIs
Big Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIs
Matt Stubbs
 
Introduction to Spark with Scala
Introduction to Spark with ScalaIntroduction to Spark with Scala
Introduction to Spark with Scala
Himanshu Gupta
 
[QE 2018] Łukasz Gawron – Testing Batch and Streaming Spark Applications
[QE 2018] Łukasz Gawron – Testing Batch and Streaming Spark Applications[QE 2018] Łukasz Gawron – Testing Batch and Streaming Spark Applications
[QE 2018] Łukasz Gawron – Testing Batch and Streaming Spark Applications
Future Processing
 
Testing batch and streaming Spark applications
Testing batch and streaming Spark applicationsTesting batch and streaming Spark applications
Testing batch and streaming Spark applications
Łukasz Gawron
 
Spark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted MalaskaSpark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted Malaska
Spark Summit
 
User Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love StoryUser Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love Story
Databricks
 
User Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love StoryUser Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love Story
Databricks
 
Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013
Samir Bessalah
 
Spark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard MaasSpark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Summit
 
The Ring programming language version 1.5.4 book - Part 37 of 185
The Ring programming language version 1.5.4 book - Part 37 of 185The Ring programming language version 1.5.4 book - Part 37 of 185
The Ring programming language version 1.5.4 book - Part 37 of 185
Mahmoud Samir Fayed
 
SparkSQLの構文解析
SparkSQLの構文解析SparkSQLの構文解析
SparkSQLの構文解析
ゆり 井上
 
The Ring programming language version 1.5.3 book - Part 54 of 184
The Ring programming language version 1.5.3 book - Part 54 of 184The Ring programming language version 1.5.3 book - Part 54 of 184
The Ring programming language version 1.5.3 book - Part 54 of 184
Mahmoud Samir Fayed
 
The Ring programming language version 1.5.3 book - Part 44 of 184
The Ring programming language version 1.5.3 book - Part 44 of 184The Ring programming language version 1.5.3 book - Part 44 of 184
The Ring programming language version 1.5.3 book - Part 44 of 184
Mahmoud Samir Fayed
 
Ad

More from DaeMyung Kang (20)

Count min sketch
Count min sketchCount min sketch
Count min sketch
DaeMyung Kang
 
Redis
RedisRedis
Redis
DaeMyung Kang
 
Ansible
AnsibleAnsible
Ansible
DaeMyung Kang
 
Why GUID is needed
Why GUID is neededWhy GUID is needed
Why GUID is needed
DaeMyung Kang
 
How to use redis well
How to use redis wellHow to use redis well
How to use redis well
DaeMyung Kang
 
The easiest consistent hashing
The easiest consistent hashingThe easiest consistent hashing
The easiest consistent hashing
DaeMyung Kang
 
How to name a cache key
How to name a cache keyHow to name a cache key
How to name a cache key
DaeMyung Kang
 
Integration between Filebeat and logstash
Integration between Filebeat and logstash Integration between Filebeat and logstash
Integration between Filebeat and logstash
DaeMyung Kang
 
How to build massive service for advance
How to build massive service for advanceHow to build massive service for advance
How to build massive service for advance
DaeMyung Kang
 
Massive service basic
Massive service basicMassive service basic
Massive service basic
DaeMyung Kang
 
Data Engineering 101
Data Engineering 101Data Engineering 101
Data Engineering 101
DaeMyung Kang
 
How To Become Better Engineer
How To Become Better EngineerHow To Become Better Engineer
How To Become Better Engineer
DaeMyung Kang
 
Kafka timestamp offset_final
Kafka timestamp offset_finalKafka timestamp offset_final
Kafka timestamp offset_final
DaeMyung Kang
 
Kafka timestamp offset
Kafka timestamp offsetKafka timestamp offset
Kafka timestamp offset
DaeMyung Kang
 
Data pipeline and data lake
Data pipeline and data lakeData pipeline and data lake
Data pipeline and data lake
DaeMyung Kang
 
Redis acl
Redis aclRedis acl
Redis acl
DaeMyung Kang
 
Coffee store
Coffee storeCoffee store
Coffee store
DaeMyung Kang
 
Scalable webservice
Scalable webserviceScalable webservice
Scalable webservice
DaeMyung Kang
 
Number system
Number systemNumber system
Number system
DaeMyung Kang
 
webservice scaling for newbie
webservice scaling for newbiewebservice scaling for newbie
webservice scaling for newbie
DaeMyung Kang
 
How to use redis well
How to use redis wellHow to use redis well
How to use redis well
DaeMyung Kang
 
The easiest consistent hashing
The easiest consistent hashingThe easiest consistent hashing
The easiest consistent hashing
DaeMyung Kang
 
How to name a cache key
How to name a cache keyHow to name a cache key
How to name a cache key
DaeMyung Kang
 
Integration between Filebeat and logstash
Integration between Filebeat and logstash Integration between Filebeat and logstash
Integration between Filebeat and logstash
DaeMyung Kang
 
How to build massive service for advance
How to build massive service for advanceHow to build massive service for advance
How to build massive service for advance
DaeMyung Kang
 
Massive service basic
Massive service basicMassive service basic
Massive service basic
DaeMyung Kang
 
Data Engineering 101
Data Engineering 101Data Engineering 101
Data Engineering 101
DaeMyung Kang
 
How To Become Better Engineer
How To Become Better EngineerHow To Become Better Engineer
How To Become Better Engineer
DaeMyung Kang
 
Kafka timestamp offset_final
Kafka timestamp offset_finalKafka timestamp offset_final
Kafka timestamp offset_final
DaeMyung Kang
 
Kafka timestamp offset
Kafka timestamp offsetKafka timestamp offset
Kafka timestamp offset
DaeMyung Kang
 
Data pipeline and data lake
Data pipeline and data lakeData pipeline and data lake
Data pipeline and data lake
DaeMyung Kang
 
webservice scaling for newbie
webservice scaling for newbiewebservice scaling for newbie
webservice scaling for newbie
DaeMyung Kang
 
Ad

Recently uploaded (20)

Crazy Incentives and How They Kill Security. How Do You Turn the Wheel?
Crazy Incentives and How They Kill Security. How Do You Turn the Wheel?Crazy Incentives and How They Kill Security. How Do You Turn the Wheel?
Crazy Incentives and How They Kill Security. How Do You Turn the Wheel?
Christian Folini
 
RTP Over QUIC: An Interesting Opportunity Or Wasted Time?
RTP Over QUIC: An Interesting Opportunity Or Wasted Time?RTP Over QUIC: An Interesting Opportunity Or Wasted Time?
RTP Over QUIC: An Interesting Opportunity Or Wasted Time?
Lorenzo Miniero
 
Limecraft Webinar - 2025.3 release, featuring Content Delivery, Graphic Conte...
Limecraft Webinar - 2025.3 release, featuring Content Delivery, Graphic Conte...Limecraft Webinar - 2025.3 release, featuring Content Delivery, Graphic Conte...
Limecraft Webinar - 2025.3 release, featuring Content Delivery, Graphic Conte...
Maarten Verwaest
 
Digital Technologies for Culture, Arts and Heritage: Insights from Interdisci...
Digital Technologies for Culture, Arts and Heritage: Insights from Interdisci...Digital Technologies for Culture, Arts and Heritage: Insights from Interdisci...
Digital Technologies for Culture, Arts and Heritage: Insights from Interdisci...
Vasileios Komianos
 
Kit-Works Team Study_팀스터디_김한솔_nuqs_20250509.pdf
Kit-Works Team Study_팀스터디_김한솔_nuqs_20250509.pdfKit-Works Team Study_팀스터디_김한솔_nuqs_20250509.pdf
Kit-Works Team Study_팀스터디_김한솔_nuqs_20250509.pdf
Wonjun Hwang
 
論文紹介:"InfLoRA: Interference-Free Low-Rank Adaptation for Continual Learning" ...
論文紹介:"InfLoRA: Interference-Free Low-Rank Adaptation for Continual Learning" ...論文紹介:"InfLoRA: Interference-Free Low-Rank Adaptation for Continual Learning" ...
論文紹介:"InfLoRA: Interference-Free Low-Rank Adaptation for Continual Learning" ...
Toru Tamaki
 
Refactoring meta-rauc-community: Cleaner Code, Better Maintenance, More Machines
Refactoring meta-rauc-community: Cleaner Code, Better Maintenance, More MachinesRefactoring meta-rauc-community: Cleaner Code, Better Maintenance, More Machines
Refactoring meta-rauc-community: Cleaner Code, Better Maintenance, More Machines
Leon Anavi
 
AI-proof your career by Olivier Vroom and David WIlliamson
AI-proof your career by Olivier Vroom and David WIlliamsonAI-proof your career by Olivier Vroom and David WIlliamson
AI-proof your career by Olivier Vroom and David WIlliamson
UXPA Boston
 
Cybersecurity Tools and Technologies - Microsoft Certificate
Cybersecurity Tools and Technologies - Microsoft CertificateCybersecurity Tools and Technologies - Microsoft Certificate
Cybersecurity Tools and Technologies - Microsoft Certificate
VICTOR MAESTRE RAMIREZ
 
Slack like a pro: strategies for 10x engineering teams
Slack like a pro: strategies for 10x engineering teamsSlack like a pro: strategies for 10x engineering teams
Slack like a pro: strategies for 10x engineering teams
Nacho Cougil
 
Mastering Testing in the Modern F&B Landscape
Mastering Testing in the Modern F&B LandscapeMastering Testing in the Modern F&B Landscape
Mastering Testing in the Modern F&B Landscape
marketing943205
 
Agentic Automation - Delhi UiPath Community Meetup
Agentic Automation - Delhi UiPath Community MeetupAgentic Automation - Delhi UiPath Community Meetup
Agentic Automation - Delhi UiPath Community Meetup
Manoj Batra (1600 + Connections)
 
UiPath AgentHack - Build the AI agents of tomorrow_Enablement 1.pptx
UiPath AgentHack - Build the AI agents of tomorrow_Enablement 1.pptxUiPath AgentHack - Build the AI agents of tomorrow_Enablement 1.pptx
UiPath AgentHack - Build the AI agents of tomorrow_Enablement 1.pptx
anabulhac
 
Who's choice? Making decisions with and about Artificial Intelligence, Keele ...
Who's choice? Making decisions with and about Artificial Intelligence, Keele ...Who's choice? Making decisions with and about Artificial Intelligence, Keele ...
Who's choice? Making decisions with and about Artificial Intelligence, Keele ...
Alan Dix
 
Harmonizing Multi-Agent Intelligence | Open Data Science Conference | Gary Ar...
Harmonizing Multi-Agent Intelligence | Open Data Science Conference | Gary Ar...Harmonizing Multi-Agent Intelligence | Open Data Science Conference | Gary Ar...
Harmonizing Multi-Agent Intelligence | Open Data Science Conference | Gary Ar...
Gary Arora
 
OpenAI Just Announced Codex: A cloud engineering agent that excels in handlin...
OpenAI Just Announced Codex: A cloud engineering agent that excels in handlin...OpenAI Just Announced Codex: A cloud engineering agent that excels in handlin...
OpenAI Just Announced Codex: A cloud engineering agent that excels in handlin...
SOFTTECHHUB
 
Building a research repository that works by Clare Cady
Building a research repository that works by Clare CadyBuilding a research repository that works by Clare Cady
Building a research repository that works by Clare Cady
UXPA Boston
 
May Patch Tuesday
May Patch TuesdayMay Patch Tuesday
May Patch Tuesday
Ivanti
 
Design pattern talk by Kaya Weers - 2025 (v2)
Design pattern talk by Kaya Weers - 2025 (v2)Design pattern talk by Kaya Weers - 2025 (v2)
Design pattern talk by Kaya Weers - 2025 (v2)
Kaya Weers
 
ICDCC 2025: Securing Agentic AI - Eryk Budi Pratama.pdf
ICDCC 2025: Securing Agentic AI - Eryk Budi Pratama.pdfICDCC 2025: Securing Agentic AI - Eryk Budi Pratama.pdf
ICDCC 2025: Securing Agentic AI - Eryk Budi Pratama.pdf
Eryk Budi Pratama
 
Crazy Incentives and How They Kill Security. How Do You Turn the Wheel?
Crazy Incentives and How They Kill Security. How Do You Turn the Wheel?Crazy Incentives and How They Kill Security. How Do You Turn the Wheel?
Crazy Incentives and How They Kill Security. How Do You Turn the Wheel?
Christian Folini
 
RTP Over QUIC: An Interesting Opportunity Or Wasted Time?
RTP Over QUIC: An Interesting Opportunity Or Wasted Time?RTP Over QUIC: An Interesting Opportunity Or Wasted Time?
RTP Over QUIC: An Interesting Opportunity Or Wasted Time?
Lorenzo Miniero
 
Limecraft Webinar - 2025.3 release, featuring Content Delivery, Graphic Conte...
Limecraft Webinar - 2025.3 release, featuring Content Delivery, Graphic Conte...Limecraft Webinar - 2025.3 release, featuring Content Delivery, Graphic Conte...
Limecraft Webinar - 2025.3 release, featuring Content Delivery, Graphic Conte...
Maarten Verwaest
 
Digital Technologies for Culture, Arts and Heritage: Insights from Interdisci...
Digital Technologies for Culture, Arts and Heritage: Insights from Interdisci...Digital Technologies for Culture, Arts and Heritage: Insights from Interdisci...
Digital Technologies for Culture, Arts and Heritage: Insights from Interdisci...
Vasileios Komianos
 
Kit-Works Team Study_팀스터디_김한솔_nuqs_20250509.pdf
Kit-Works Team Study_팀스터디_김한솔_nuqs_20250509.pdfKit-Works Team Study_팀스터디_김한솔_nuqs_20250509.pdf
Kit-Works Team Study_팀스터디_김한솔_nuqs_20250509.pdf
Wonjun Hwang
 
論文紹介:"InfLoRA: Interference-Free Low-Rank Adaptation for Continual Learning" ...
論文紹介:"InfLoRA: Interference-Free Low-Rank Adaptation for Continual Learning" ...論文紹介:"InfLoRA: Interference-Free Low-Rank Adaptation for Continual Learning" ...
論文紹介:"InfLoRA: Interference-Free Low-Rank Adaptation for Continual Learning" ...
Toru Tamaki
 
Refactoring meta-rauc-community: Cleaner Code, Better Maintenance, More Machines
Refactoring meta-rauc-community: Cleaner Code, Better Maintenance, More MachinesRefactoring meta-rauc-community: Cleaner Code, Better Maintenance, More Machines
Refactoring meta-rauc-community: Cleaner Code, Better Maintenance, More Machines
Leon Anavi
 
AI-proof your career by Olivier Vroom and David WIlliamson
AI-proof your career by Olivier Vroom and David WIlliamsonAI-proof your career by Olivier Vroom and David WIlliamson
AI-proof your career by Olivier Vroom and David WIlliamson
UXPA Boston
 
Cybersecurity Tools and Technologies - Microsoft Certificate
Cybersecurity Tools and Technologies - Microsoft CertificateCybersecurity Tools and Technologies - Microsoft Certificate
Cybersecurity Tools and Technologies - Microsoft Certificate
VICTOR MAESTRE RAMIREZ
 
Slack like a pro: strategies for 10x engineering teams
Slack like a pro: strategies for 10x engineering teamsSlack like a pro: strategies for 10x engineering teams
Slack like a pro: strategies for 10x engineering teams
Nacho Cougil
 
Mastering Testing in the Modern F&B Landscape
Mastering Testing in the Modern F&B LandscapeMastering Testing in the Modern F&B Landscape
Mastering Testing in the Modern F&B Landscape
marketing943205
 
UiPath AgentHack - Build the AI agents of tomorrow_Enablement 1.pptx
UiPath AgentHack - Build the AI agents of tomorrow_Enablement 1.pptxUiPath AgentHack - Build the AI agents of tomorrow_Enablement 1.pptx
UiPath AgentHack - Build the AI agents of tomorrow_Enablement 1.pptx
anabulhac
 
Who's choice? Making decisions with and about Artificial Intelligence, Keele ...
Who's choice? Making decisions with and about Artificial Intelligence, Keele ...Who's choice? Making decisions with and about Artificial Intelligence, Keele ...
Who's choice? Making decisions with and about Artificial Intelligence, Keele ...
Alan Dix
 
Harmonizing Multi-Agent Intelligence | Open Data Science Conference | Gary Ar...
Harmonizing Multi-Agent Intelligence | Open Data Science Conference | Gary Ar...Harmonizing Multi-Agent Intelligence | Open Data Science Conference | Gary Ar...
Harmonizing Multi-Agent Intelligence | Open Data Science Conference | Gary Ar...
Gary Arora
 
OpenAI Just Announced Codex: A cloud engineering agent that excels in handlin...
OpenAI Just Announced Codex: A cloud engineering agent that excels in handlin...OpenAI Just Announced Codex: A cloud engineering agent that excels in handlin...
OpenAI Just Announced Codex: A cloud engineering agent that excels in handlin...
SOFTTECHHUB
 
Building a research repository that works by Clare Cady
Building a research repository that works by Clare CadyBuilding a research repository that works by Clare Cady
Building a research repository that works by Clare Cady
UXPA Boston
 
May Patch Tuesday
May Patch TuesdayMay Patch Tuesday
May Patch Tuesday
Ivanti
 
Design pattern talk by Kaya Weers - 2025 (v2)
Design pattern talk by Kaya Weers - 2025 (v2)Design pattern talk by Kaya Weers - 2025 (v2)
Design pattern talk by Kaya Weers - 2025 (v2)
Kaya Weers
 
ICDCC 2025: Securing Agentic AI - Eryk Budi Pratama.pdf
ICDCC 2025: Securing Agentic AI - Eryk Budi Pratama.pdfICDCC 2025: Securing Agentic AI - Eryk Budi Pratama.pdf
ICDCC 2025: Securing Agentic AI - Eryk Budi Pratama.pdf
Eryk Budi Pratama
 

Using spark data frame for sql

  • 1. Basic Using Spark DataFrame For SQL charsyam@naver.com
  • 2. Create DataFrame From File val path = “abc.txt” val df = spark.read.text(path)
  • 3. Create DataFrame From Kafka val rdd = KafkaUtils.createRDD[String, String](...) val logsDF = rdd.map { _.value }.toDF
  • 4. Spark DataFrame Column 1) col("column name") 2) $"column name" 1) And 2) are the same.
  • 5. Simple Iris TSV Logs http://www.math.uah.edu/stat/data/Fisher.txt Type PW PL SW SL 0 2 14 33 50 1 24 56 31 67 1 23 51 31 69 0 2 10 36 46 1 20 52 30 65 1 19 51 27 58
  • 6. Load TSV with StructType import org.apache.spark.sql.types._ var irisSchema = StructType(Array( StructField("Type", IntegerType, true), StructField("PetalWidth", IntegerType, true), StructField("PetalLength", IntegerType, true), StructField("SepalWidth", IntegerType, true), StructField("SepalLength", IntegerType, true) ))
  • 7. Load TSV with Encoder #1 import org.apache.spark.sql.Encoders case class IrisSchema(Type: Int, PetalWidth: Int, PetalLength: Int, SepalWidth: Int, SepalLength: Int) var irisSchema = Encoders.product[IrisSchema].schema
  • 8. Load TSV var irisDf = spark.read.format("csv"). // Use "csv" regardless of TSV or CSV. option("header", "true"). // Does the file have a header line? option("delimiter", "t"). // Set delimiter to tab or comma. schema(irisSchema). // Schema that was built above. load("Fisher.txt") irisDf.show(5)
  • 9. Load TSV - Show Results scala> irisDf.show(5) +----+----------+-----------+----------+-----------+ |Type|PetalWidth|PetalLength|SepalWidth|SepalLength| +----+----------+-----------+----------+-----------+ | 0| 2| 14| 33| 50| | 1| 24| 56| 31| 67| | 1| 23| 51| 31| 69| | 0| 2| 10| 36| 46| | 1| 20| 52| 30| 65| +----+----------+-----------+----------+-----------+ only showing top 5 rows
  • 10. Using sqlContext sql Super easy way val view = df.createOrReplaceTempView("tmp_iris") val resultDF = df.sqlContext.sql("select type, PetalWidth from tmp_iris")
  • 11. Simple Select SQL: Select type, petalwidth + sepalwidth as sum_width from … val sumDF = df.withColumn("sum_width", col("PetalWidth") + col("SepalWidth")) val resultDF = sumDF.selectExpr("Type", "sum_width") val resultDF = sumDF.selectExpr("*") ← select *
  • 12. Select with where SQL: Select type, petalwidth from … where petalwidth > 10 val whereDF = df.filter($"petalwidth" > 10) val whereDF = df.where($"petalwidth" > 10) //filter and where are the same val resultDF = whereDF.selectExpr("Type", "petalwidth")
  • 13. Select with order by SQL: Select petalwidth, sepalwidth from … order by petalwidth, sepalwidth desc 1) val sortDF = df.sort($"petalwidth", $"sepalwidth".desc) 2) val sortDF = df.sort($"petalwidth", desc("sepalwidth")) 3) val sortDF = df.orderBy($"petalwidth", desc("sepalwidth")) 1), 2) And 3) are the same. val resultDF = sortDF.selectExpr("petalwidth", "sepalwidth")
  • 14. Select with Group by SQL: Select type, max(petalwidth) A, min(sepalwidth) B from … group by type val groupDF = df.groupBy($"type").agg(max($"petalwidth").as("A"), min($"sepalwidth").as("B")) val resultDF = groupDF.selectExpr("type", "A", "B")
  • 15. Tip - Support MapType<String, String> like Hive SQL in Hive: Create table test (type map<string, string>); Hive support str_to_map, but spark not support for dataframe(spark support str_to_map for hiveQL). Using udf to solve this. val string_line = "A=1,B=2,C=3" Val df = logsDF.withColumn("type", str_to_map(string_line))
  • 16. UDF - str_to_map val str_to_map = udf { text : String => val pairs = text.split("delimiter1|delimiter2").grouped(2) pairs.map { case Array(k, v) => k -> v}.toMap }
  翻译: