Natural Language Query and Conversational Interface to Apache Spark

Introduction to Auto
Loader
How to easily ingest PBs of data
into your Delta lakehouse
Pranav Anand
Software Engineer at Databricks

s3:/logs
Cloud Storage
Input ﬁles containing
unstructured and
semi-structured data
Final data
Building a data pipeline

Building a data pipeline: Step one
s3:/logs
Cloud Storage
Structured
Streaming
Input ﬁles containing
unstructured and
semi-structured data
Structured tables
Data Ingestion

s3:/logs
Cloud Storage
Structured
Streaming
Input ﬁles
Structured tables
“Here be dragons”
Building a data pipeline: Step one

input path: s3:/logs
Cloud Storage
File Stream Source Final data
a.json, b.json
List on trigger at t =
0 Transformed
data
Challenge: Scalability

Cloud Storage
c.json, a.json, b.json
5 Transformed
data
Seen ﬁle
paths
a.json
b.json

Cloud Storage
5 Transformed
data
Seen ﬁle
paths
a.json
b.json
Repeated listing is
slow and expensive

Cloud Storage
5 Transformed
data
Seen ﬁle
paths map
a.json
b.json
In-memory map
does not scale
Repeated listing is
slow and expensive

Challenge: Schema
Cloud Storage
{
id: 5
name: “John”
}
{
id: 7
name: “Amy”
}
a.json
b.json
Manually infer and
set schema as
id: Int
name: String
Ready to go!
Input ﬁles

Natural Language Query and Conversational Interface to Apache Spark

Challenge: Schema
Cloud Storage
Input ﬁles
{
id: 18,
name: “Olivia”
}
{
id: 23,
name: “Alex”,
age: 31
}
e.json
f.json
Data loss!
Manually handle this
column. User intervention
needed every time.
How do I deal with all
this?!

Auto Loader
▪ New Structured
Streaming source
▪ Solves biggest
ingestion challenges:
▪ Scalability
▪ Schema management

File Notiﬁcation Mode
s3:/logs
Cloud Storage
Input ﬁles

s3:/logs
Cloud Storage
Input ﬁles
a.json
File notiﬁcation
generated for a.json
a.json
s3:/logs

Input files
Cloud files
source
Pull a.json from
queue
Delete a.json
from queue
once ingested
a.json
s3:/logs
Cloud Storage
a.json
File notification
s3:/logs
No listing!
Files ingested as they
arrive.

Input files
Pull a.json from
queue
Delete a.json
from queue
once ingested
a.json
s3:/logs
Cloud Storage
a.json
File notification
s3:/logs
Cloud files
source
Seen file paths
map
in RocksDB
a.json
RocksDB deduplication
means no scalability limits

Backfill
Input files
Pull a.json from
cloud queue
Delete a.json
from queue
once ingested
a.json
Cloud Storage
a.json
File notification
Cloud files
source
Seen file paths
map
in RocksDB
a.json
A.json
a.json, A.json
Include existing files
A.json
Internal
queue
a.json
A.json
Existing files

spark
.readStream
.format(“cloudFiles”)
.option(“cloudFiles.format”, “json”)
.option(“cloudFiles.useNotiﬁcations”, “true”)
.option(“cloudFiles.includeExistingFiles”,
“true”)
.load()
▪ No repeated listing
▪ Scalable to many millions of ﬁles
▪ Done simply

Challenge: Schema
Cloud Storage
Input ﬁles
a.json
Starting off more simply
{
id: 5
name: “John”
}
{
id: 7
name: “Amy”
}
b.json
Automatically infer
and set schema as
id: Int
name: String
Ready to go!

Challenge: Schema
Cloud Storage
Input ﬁles
{
id: 18,
name: “Olivia”
}
{
id: 23,
name: “Alex”,
age: 31
}
e.json
f.json
Where we hit a dead end...
this?!
Data loss!
needed every time.

Challenge: Schema
Cloud Storage
Input ﬁles
{
id: 18,
name: “Olivia”
}
{
id: 23,
name: “Alex”,
age: 31
}
e.json
f.json
What we can do about it
this?!
+ Auto Loader
Data loss!
needed every time.

Challenge: Schema
Cloud Storage
Input ﬁles
{
id: 18,
name: “Olivia”
}
{
id: 23,
name: “Alex”,
age: 31
}
e.json
f.json
Self-sustaining stream - Evolve schema
Schema was
inferred as
id: Int
name: String
New schema:
id: Int
name: String
age: Int

Challenge: Schema
Cloud Storage
Input ﬁles
{
id: 18,
name: “Olivia”
}
{
id: 23,
name: “Alex”,
age: 31
}
e.json
f.json
Self-sustaining stream - Evolve schema
Schema was
inferred as
id: Int
name: String
New schema:
id: Int
name: String
age: Int
Set and forget!
.option(“cloudFiles.schemaEvolutionMode”, “addNewColumns”)

Challenge: Schema
Cloud Storage
Input ﬁles
{
id: 18,
name: “Olivia”
}
{
id: 23,
name: “Alex”,
age: 31
}
e.json
f.json
Self-sustaining stream: Rescue data
Schema was
inferred as
id: Int
name: String
_rescued_data: String
New schema:
id: Int
name: String
Set and forget!
.option(“cloudFiles.schemaEvolutionMode”, “rescue”)

Challenge: Schema
Cloud Storage
Input ﬁles
{
id: 18,
name: “Olivia”
}
{
id: 23,
name: “Alex”,
age: 31
}
e.json
f.json
Self-sustaining stream: Rescue data
Schema was
inferred as
id: Int
name: String
New schema:
id: Int
name: String
Set and forget!
.option(“cloudFiles.schemaEvolutionMode”, “rescue”)
Other modes like “failOnNewColumns” and “none”

In practice
▪ 50+ TB per day of logs
ingested at Databricks
using Auto Loader
▪ Customers have used
Auto Loader to ingest
10s of PB of data

Feedback
Your feedback is important to us.
Don’t forget to rate and review the sessions.

Natural Language Query and Conversational Interface to Apache Spark

Recommended

More Related Content

What's hot (20)

Similar to Natural Language Query and Conversational Interface to Apache Spark (20)

More from Databricks (20)

Recently uploaded (20)

Natural Language Query and Conversational Interface to Apache Spark