SlideShare a Scribd company logo
Introduction to Auto
Loader
How to easily ingest PBs of data
into your Delta lakehouse
Pranav Anand
Software Engineer at Databricks
s3:/logs
Cloud Storage
Input files containing
unstructured and
semi-structured data
Final data
Building a data pipeline
Building a data pipeline: Step one
s3:/logs
Cloud Storage
Structured
Streaming
Input files containing
unstructured and
semi-structured data
Structured tables
Data Ingestion
s3:/logs
Cloud Storage
Structured
Streaming
Input files
Structured tables
“Here be dragons”
Building a data pipeline: Step one
input path: s3:/logs
Cloud Storage
File Stream Source Final data
a.json, b.json
List on trigger at t =
0 Transformed
data
Challenge: Scalability
input path: s3:/logs
Cloud Storage
File Stream Source Final data
c.json, a.json, b.json
List on trigger at t =
5 Transformed
data
Challenge: Scalability
Seen file
paths
a.json
b.json
input path: s3:/logs
Cloud Storage
File Stream Source Final data
c.json, a.json, b.json
List on trigger at t =
5 Transformed
data
Challenge: Scalability
Seen file
paths
a.json
b.json
Repeated listing is
slow and expensive
input path: s3:/logs
Cloud Storage
File Stream Source Final data
c.json, a.json, b.json
List on trigger at t =
5 Transformed
data
Challenge: Scalability
Seen file
paths map
a.json
b.json
In-memory map
does not scale
Repeated listing is
slow and expensive
Challenge: Schema
Cloud Storage
{
id: 5
name: “John”
}
{
id: 7
name: “Amy”
}
a.json
b.json
Manually infer and
set schema as
id: Int
name: String
Ready to go!
Input files
Natural Language Query and Conversational Interface to Apache Spark
Challenge: Schema
Cloud Storage
Input files
{
id: 18,
name: “Olivia”
}
{
id: 23,
name: “Alex”,
age: 31
}
e.json
f.json
Data loss!
Manually handle this
column. User intervention
needed every time.
How do I deal with all
this?!
Auto Loader
▪ New Structured
Streaming source
▪ Solves biggest
ingestion challenges:
▪ Scalability
▪ Schema management
Challenge: Scalability
File Notification Mode
s3:/logs
Cloud Storage
Input files
Challenge: Scalability
File Notification Mode
s3:/logs
Cloud Storage
Input files
a.json
File notification
generated for a.json
a.json
s3:/logs
Challenge: Scalability
File Notification Mode
Input files
Cloud files
source
Pull a.json from
queue
Delete a.json
from queue
once ingested
a.json
s3:/logs
Cloud Storage
a.json
File notification
generated for a.json
s3:/logs
No listing!
Files ingested as they
arrive.
Challenge: Scalability
File Notification Mode
Input files
Pull a.json from
queue
Delete a.json
from queue
once ingested
a.json
s3:/logs
Cloud Storage
a.json
File notification
generated for a.json
s3:/logs
Cloud files
source
Seen file paths
map
in RocksDB
a.json
RocksDB deduplication
means no scalability limits
Challenge: Scalability
Backfill
Input files
Pull a.json from
cloud queue
Delete a.json
from queue
once ingested
a.json
Cloud Storage
a.json
File notification
generated for a.json
Cloud files
source
Seen file paths
map
in RocksDB
a.json
A.json
a.json, A.json
Include existing files
A.json
Internal
queue
a.json
A.json
Existing files
Challenge: Scalability
File Notification Mode
spark
.readStream
.format(“cloudFiles”)
.option(“cloudFiles.format”, “json”)
.option(“cloudFiles.useNotifications”, “true”)
.option(“cloudFiles.includeExistingFiles”,
“true”)
.load()
▪ No repeated listing
▪ Scalable to many millions of files
▪ Done simply
Challenge: Schema
Cloud Storage
Input files
a.json
Starting off more simply
{
id: 5
name: “John”
}
{
id: 7
name: “Amy”
}
b.json
Automatically infer
and set schema as
id: Int
name: String
Ready to go!
Challenge: Schema
Cloud Storage
Input files
{
id: 18,
name: “Olivia”
}
{
id: 23,
name: “Alex”,
age: 31
}
e.json
f.json
Where we hit a dead end...
How do I deal with all
this?!
Data loss!
Manually handle this
column. User intervention
needed every time.
Challenge: Schema
Cloud Storage
Input files
{
id: 18,
name: “Olivia”
}
{
id: 23,
name: “Alex”,
age: 31
}
e.json
f.json
What we can do about it
How do I deal with all
this?!
+ Auto Loader
Data loss!
Manually handle this
column. User intervention
needed every time.
Challenge: Schema
Cloud Storage
Input files
{
id: 18,
name: “Olivia”
}
{
id: 23,
name: “Alex”,
age: 31
}
e.json
f.json
Self-sustaining stream - Evolve schema
Schema was
inferred as
id: Int
name: String
New schema:
id: Int
name: String
age: Int
Challenge: Schema
Cloud Storage
Input files
{
id: 18,
name: “Olivia”
}
{
id: 23,
name: “Alex”,
age: 31
}
e.json
f.json
Self-sustaining stream - Evolve schema
Schema was
inferred as
id: Int
name: String
New schema:
id: Int
name: String
age: Int
Set and forget!
.option(“cloudFiles.schemaEvolutionMode”, “addNewColumns”)
Challenge: Schema
Cloud Storage
Input files
{
id: 18,
name: “Olivia”
}
{
id: 23,
name: “Alex”,
age: 31
}
e.json
f.json
Self-sustaining stream: Rescue data
Schema was
inferred as
id: Int
name: String
_rescued_data: String
New schema:
id: Int
name: String
_rescued_data: String
Set and forget!
.option(“cloudFiles.schemaEvolutionMode”, “rescue”)
Challenge: Schema
Cloud Storage
Input files
{
id: 18,
name: “Olivia”
}
{
id: 23,
name: “Alex”,
age: 31
}
e.json
f.json
Self-sustaining stream: Rescue data
Schema was
inferred as
id: Int
name: String
_rescued_data: String
New schema:
id: Int
name: String
_rescued_data: String
Set and forget!
.option(“cloudFiles.schemaEvolutionMode”, “rescue”)
Other modes like “failOnNewColumns” and “none”
Let’s look at the code
In practice
▪ 50+ TB per day of logs
ingested at Databricks
using Auto Loader
▪ Customers have used
Auto Loader to ingest
10s of PB of data
Feedback
Your feedback is important to us.
Don’t forget to rate and review the sessions.
Natural Language Query and Conversational Interface to Apache Spark
Ad

More Related Content

What's hot (20)

Hoodie: How (And Why) We built an analytical datastore on Spark
Hoodie: How (And Why) We built an analytical datastore on SparkHoodie: How (And Why) We built an analytical datastore on Spark
Hoodie: How (And Why) We built an analytical datastore on Spark
Vinoth Chandar
 
Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...
Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...
Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...
Databricks
 
MLflow Model Serving
MLflow Model ServingMLflow Model Serving
MLflow Model Serving
Databricks
 
Multi dimension aggregations using spark and dataframes
Multi dimension aggregations using spark and dataframesMulti dimension aggregations using spark and dataframes
Multi dimension aggregations using spark and dataframes
Romi Kuntsman
 
Prestogres internals
Prestogres internalsPrestogres internals
Prestogres internals
Sadayuki Furuhashi
 
Fluentd at Bay Area Kubernetes Meetup
Fluentd at Bay Area Kubernetes MeetupFluentd at Bay Area Kubernetes Meetup
Fluentd at Bay Area Kubernetes Meetup
Sadayuki Furuhashi
 
Sqoop on Spark for Data Ingestion
Sqoop on Spark for Data IngestionSqoop on Spark for Data Ingestion
Sqoop on Spark for Data Ingestion
DataWorks Summit
 
Data Analytics Service Company and Its Ruby Usage
Data Analytics Service Company and Its Ruby UsageData Analytics Service Company and Its Ruby Usage
Data Analytics Service Company and Its Ruby Usage
SATOSHI TAGOMORI
 
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced AnalyticsETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
Miklos Christine
 
Getting started with Apollo Client and GraphQL
Getting started with Apollo Client and GraphQLGetting started with Apollo Client and GraphQL
Getting started with Apollo Client and GraphQL
Morgan Dedmon
 
Operational Tips for Deploying Spark by Miklos Christine
Operational Tips for Deploying Spark by Miklos ChristineOperational Tips for Deploying Spark by Miklos Christine
Operational Tips for Deploying Spark by Miklos Christine
Spark Summit
 
Presto - Hadoop Conference Japan 2014
Presto - Hadoop Conference Japan 2014Presto - Hadoop Conference Japan 2014
Presto - Hadoop Conference Japan 2014
Sadayuki Furuhashi
 
Building Hadoop Data Applications with Kite
Building Hadoop Data Applications with KiteBuilding Hadoop Data Applications with Kite
Building Hadoop Data Applications with Kite
huguk
 
Logging for Production Systems in The Container Era
Logging for Production Systems in The Container EraLogging for Production Systems in The Container Era
Logging for Production Systems in The Container Era
Sadayuki Furuhashi
 
Flink Forward SF 2017: James Malone - Make The Cloud Work For You
Flink Forward SF 2017: James Malone - Make The Cloud Work For YouFlink Forward SF 2017: James Malone - Make The Cloud Work For You
Flink Forward SF 2017: James Malone - Make The Cloud Work For You
Flink Forward
 
Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim Dowling
Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim DowlingStructured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim Dowling
Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim Dowling
Databricks
 
A Journey into Databricks' Pipelines: Journey and Lessons Learned
A Journey into Databricks' Pipelines: Journey and Lessons LearnedA Journey into Databricks' Pipelines: Journey and Lessons Learned
A Journey into Databricks' Pipelines: Journey and Lessons Learned
Databricks
 
Mapping Data Flows Perf Tuning April 2021
Mapping Data Flows Perf Tuning April 2021Mapping Data Flows Perf Tuning April 2021
Mapping Data Flows Perf Tuning April 2021
Mark Kromer
 
Fast real-time approximations using Spark streaming
Fast real-time approximations using Spark streamingFast real-time approximations using Spark streaming
Fast real-time approximations using Spark streaming
huguk
 
Building Operational Data Lake using Spark and SequoiaDB with Yang Peng
Building Operational Data Lake using Spark and SequoiaDB with Yang PengBuilding Operational Data Lake using Spark and SequoiaDB with Yang Peng
Building Operational Data Lake using Spark and SequoiaDB with Yang Peng
Databricks
 
Hoodie: How (And Why) We built an analytical datastore on Spark
Hoodie: How (And Why) We built an analytical datastore on SparkHoodie: How (And Why) We built an analytical datastore on Spark
Hoodie: How (And Why) We built an analytical datastore on Spark
Vinoth Chandar
 
Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...
Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...
Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...
Databricks
 
MLflow Model Serving
MLflow Model ServingMLflow Model Serving
MLflow Model Serving
Databricks
 
Multi dimension aggregations using spark and dataframes
Multi dimension aggregations using spark and dataframesMulti dimension aggregations using spark and dataframes
Multi dimension aggregations using spark and dataframes
Romi Kuntsman
 
Fluentd at Bay Area Kubernetes Meetup
Fluentd at Bay Area Kubernetes MeetupFluentd at Bay Area Kubernetes Meetup
Fluentd at Bay Area Kubernetes Meetup
Sadayuki Furuhashi
 
Sqoop on Spark for Data Ingestion
Sqoop on Spark for Data IngestionSqoop on Spark for Data Ingestion
Sqoop on Spark for Data Ingestion
DataWorks Summit
 
Data Analytics Service Company and Its Ruby Usage
Data Analytics Service Company and Its Ruby UsageData Analytics Service Company and Its Ruby Usage
Data Analytics Service Company and Its Ruby Usage
SATOSHI TAGOMORI
 
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced AnalyticsETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
Miklos Christine
 
Getting started with Apollo Client and GraphQL
Getting started with Apollo Client and GraphQLGetting started with Apollo Client and GraphQL
Getting started with Apollo Client and GraphQL
Morgan Dedmon
 
Operational Tips for Deploying Spark by Miklos Christine
Operational Tips for Deploying Spark by Miklos ChristineOperational Tips for Deploying Spark by Miklos Christine
Operational Tips for Deploying Spark by Miklos Christine
Spark Summit
 
Presto - Hadoop Conference Japan 2014
Presto - Hadoop Conference Japan 2014Presto - Hadoop Conference Japan 2014
Presto - Hadoop Conference Japan 2014
Sadayuki Furuhashi
 
Building Hadoop Data Applications with Kite
Building Hadoop Data Applications with KiteBuilding Hadoop Data Applications with Kite
Building Hadoop Data Applications with Kite
huguk
 
Logging for Production Systems in The Container Era
Logging for Production Systems in The Container EraLogging for Production Systems in The Container Era
Logging for Production Systems in The Container Era
Sadayuki Furuhashi
 
Flink Forward SF 2017: James Malone - Make The Cloud Work For You
Flink Forward SF 2017: James Malone - Make The Cloud Work For YouFlink Forward SF 2017: James Malone - Make The Cloud Work For You
Flink Forward SF 2017: James Malone - Make The Cloud Work For You
Flink Forward
 
Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim Dowling
Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim DowlingStructured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim Dowling
Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim Dowling
Databricks
 
A Journey into Databricks' Pipelines: Journey and Lessons Learned
A Journey into Databricks' Pipelines: Journey and Lessons LearnedA Journey into Databricks' Pipelines: Journey and Lessons Learned
A Journey into Databricks' Pipelines: Journey and Lessons Learned
Databricks
 
Mapping Data Flows Perf Tuning April 2021
Mapping Data Flows Perf Tuning April 2021Mapping Data Flows Perf Tuning April 2021
Mapping Data Flows Perf Tuning April 2021
Mark Kromer
 
Fast real-time approximations using Spark streaming
Fast real-time approximations using Spark streamingFast real-time approximations using Spark streaming
Fast real-time approximations using Spark streaming
huguk
 
Building Operational Data Lake using Spark and SequoiaDB with Yang Peng
Building Operational Data Lake using Spark and SequoiaDB with Yang PengBuilding Operational Data Lake using Spark and SequoiaDB with Yang Peng
Building Operational Data Lake using Spark and SequoiaDB with Yang Peng
Databricks
 

Similar to Natural Language Query and Conversational Interface to Apache Spark (20)

Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...
Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...
Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...
Flink Forward
 
Graph Analysis over JSON, Larus
Graph Analysis over JSON, LarusGraph Analysis over JSON, Larus
Graph Analysis over JSON, Larus
Neo4j
 
CSV JSON and XML files in Python.pptx
CSV JSON and XML files in Python.pptxCSV JSON and XML files in Python.pptx
CSV JSON and XML files in Python.pptx
Ramakrishna Reddy Bijjam
 
Apache Spark - Loading & Saving data | Big Data Hadoop Spark Tutorial | Cloud...
Apache Spark - Loading & Saving data | Big Data Hadoop Spark Tutorial | Cloud...Apache Spark - Loading & Saving data | Big Data Hadoop Spark Tutorial | Cloud...
Apache Spark - Loading & Saving data | Big Data Hadoop Spark Tutorial | Cloud...
CloudxLab
 
Einführung in MongoDB
Einführung in MongoDBEinführung in MongoDB
Einführung in MongoDB
NETUserGroupBern
 
Data file handling in python binary & csv files
Data file handling in python binary & csv filesData file handling in python binary & csv files
Data file handling in python binary & csv files
keeeerty
 
Data file handling in python binary & csv files
Data file handling in python binary & csv filesData file handling in python binary & csv files
Data file handling in python binary & csv files
Keerty Smile
 
Data Source API in Spark
Data Source API in SparkData Source API in Spark
Data Source API in Spark
Databricks
 
Play Framework: async I/O with Java and Scala
Play Framework: async I/O with Java and ScalaPlay Framework: async I/O with Java and Scala
Play Framework: async I/O with Java and Scala
Yevgeniy Brikman
 
Crafting Evolvable Api Responses
Crafting Evolvable Api ResponsesCrafting Evolvable Api Responses
Crafting Evolvable Api Responses
darrelmiller71
 
CS8651 IP Unit 2 pdf regulation -2017 anna university
CS8651  IP Unit 2 pdf regulation -2017 anna universityCS8651  IP Unit 2 pdf regulation -2017 anna university
CS8651 IP Unit 2 pdf regulation -2017 anna university
amrashbhanuabdul
 
Learning spark ch05 - Loading and Saving Your Data
Learning spark ch05 - Loading and Saving Your DataLearning spark ch05 - Loading and Saving Your Data
Learning spark ch05 - Loading and Saving Your Data
phanleson
 
Managing Social Content with MongoDB
Managing Social Content with MongoDBManaging Social Content with MongoDB
Managing Social Content with MongoDB
MongoDB
 
Working With a Real-World Dataset in Neo4j: Import and Modeling
Working With a Real-World Dataset in Neo4j: Import and ModelingWorking With a Real-World Dataset in Neo4j: Import and Modeling
Working With a Real-World Dataset in Neo4j: Import and Modeling
Neo4j
 
08 ajax
08 ajax08 ajax
08 ajax
Ynon Perek
 
Amazon S3 Masterclass
Amazon S3 MasterclassAmazon S3 Masterclass
Amazon S3 Masterclass
Ian Massingham
 
Stream or not to Stream?

Stream or not to Stream?
Stream or not to Stream?

Stream or not to Stream?

Lukasz Byczynski
 
Con fess 2013-sse-websockets-json-bhakti
Con fess 2013-sse-websockets-json-bhaktiCon fess 2013-sse-websockets-json-bhakti
Con fess 2013-sse-websockets-json-bhakti
Bhakti Mehta
 
Spray Json and MongoDB Queries: Insights and Simple Tricks.
Spray Json and MongoDB Queries: Insights and Simple Tricks.Spray Json and MongoDB Queries: Insights and Simple Tricks.
Spray Json and MongoDB Queries: Insights and Simple Tricks.
Andrii Lashchenko
 
PyconJP: Building a data preparation pipeline with Pandas and AWS Lambda
PyconJP: Building a data preparation pipeline with Pandas and AWS LambdaPyconJP: Building a data preparation pipeline with Pandas and AWS Lambda
PyconJP: Building a data preparation pipeline with Pandas and AWS Lambda
Fabian Dubois
 
Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...
Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...
Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...
Flink Forward
 
Graph Analysis over JSON, Larus
Graph Analysis over JSON, LarusGraph Analysis over JSON, Larus
Graph Analysis over JSON, Larus
Neo4j
 
Apache Spark - Loading & Saving data | Big Data Hadoop Spark Tutorial | Cloud...
Apache Spark - Loading & Saving data | Big Data Hadoop Spark Tutorial | Cloud...Apache Spark - Loading & Saving data | Big Data Hadoop Spark Tutorial | Cloud...
Apache Spark - Loading & Saving data | Big Data Hadoop Spark Tutorial | Cloud...
CloudxLab
 
Data file handling in python binary & csv files
Data file handling in python binary & csv filesData file handling in python binary & csv files
Data file handling in python binary & csv files
keeeerty
 
Data file handling in python binary & csv files
Data file handling in python binary & csv filesData file handling in python binary & csv files
Data file handling in python binary & csv files
Keerty Smile
 
Data Source API in Spark
Data Source API in SparkData Source API in Spark
Data Source API in Spark
Databricks
 
Play Framework: async I/O with Java and Scala
Play Framework: async I/O with Java and ScalaPlay Framework: async I/O with Java and Scala
Play Framework: async I/O with Java and Scala
Yevgeniy Brikman
 
Crafting Evolvable Api Responses
Crafting Evolvable Api ResponsesCrafting Evolvable Api Responses
Crafting Evolvable Api Responses
darrelmiller71
 
CS8651 IP Unit 2 pdf regulation -2017 anna university
CS8651  IP Unit 2 pdf regulation -2017 anna universityCS8651  IP Unit 2 pdf regulation -2017 anna university
CS8651 IP Unit 2 pdf regulation -2017 anna university
amrashbhanuabdul
 
Learning spark ch05 - Loading and Saving Your Data
Learning spark ch05 - Loading and Saving Your DataLearning spark ch05 - Loading and Saving Your Data
Learning spark ch05 - Loading and Saving Your Data
phanleson
 
Managing Social Content with MongoDB
Managing Social Content with MongoDBManaging Social Content with MongoDB
Managing Social Content with MongoDB
MongoDB
 
Working With a Real-World Dataset in Neo4j: Import and Modeling
Working With a Real-World Dataset in Neo4j: Import and ModelingWorking With a Real-World Dataset in Neo4j: Import and Modeling
Working With a Real-World Dataset in Neo4j: Import and Modeling
Neo4j
 
Stream or not to Stream?

Stream or not to Stream?
Stream or not to Stream?

Stream or not to Stream?

Lukasz Byczynski
 
Con fess 2013-sse-websockets-json-bhakti
Con fess 2013-sse-websockets-json-bhaktiCon fess 2013-sse-websockets-json-bhakti
Con fess 2013-sse-websockets-json-bhakti
Bhakti Mehta
 
Spray Json and MongoDB Queries: Insights and Simple Tricks.
Spray Json and MongoDB Queries: Insights and Simple Tricks.Spray Json and MongoDB Queries: Insights and Simple Tricks.
Spray Json and MongoDB Queries: Insights and Simple Tricks.
Andrii Lashchenko
 
PyconJP: Building a data preparation pipeline with Pandas and AWS Lambda
PyconJP: Building a data preparation pipeline with Pandas and AWS LambdaPyconJP: Building a data preparation pipeline with Pandas and AWS Lambda
PyconJP: Building a data preparation pipeline with Pandas and AWS Lambda
Fabian Dubois
 
Ad

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Ad

Recently uploaded (20)

2-Raction quotient_١٠٠١٤٦.ppt of physical chemisstry
2-Raction quotient_١٠٠١٤٦.ppt of physical chemisstry2-Raction quotient_١٠٠١٤٦.ppt of physical chemisstry
2-Raction quotient_١٠٠١٤٦.ppt of physical chemisstry
bastakwyry
 
Publication-launch-How-is-Life-for-Children-in-the-Digital-Age-15-May-2025.pdf
Publication-launch-How-is-Life-for-Children-in-the-Digital-Age-15-May-2025.pdfPublication-launch-How-is-Life-for-Children-in-the-Digital-Age-15-May-2025.pdf
Publication-launch-How-is-Life-for-Children-in-the-Digital-Age-15-May-2025.pdf
StatsCommunications
 
Process Mining at Deutsche Bank - Journey
Process Mining at Deutsche Bank - JourneyProcess Mining at Deutsche Bank - Journey
Process Mining at Deutsche Bank - Journey
Process mining Evangelist
 
AWS RDS Presentation to make concepts easy.pptx
AWS RDS Presentation to make concepts easy.pptxAWS RDS Presentation to make concepts easy.pptx
AWS RDS Presentation to make concepts easy.pptx
bharatkumarbhojwani
 
Z14_IBM__APL_by_Christian_Demmer_IBM.pdf
Z14_IBM__APL_by_Christian_Demmer_IBM.pdfZ14_IBM__APL_by_Christian_Demmer_IBM.pdf
Z14_IBM__APL_by_Christian_Demmer_IBM.pdf
Fariborz Seyedloo
 
What is ETL? Difference between ETL and ELT?.pdf
What is ETL? Difference between ETL and ELT?.pdfWhat is ETL? Difference between ETL and ELT?.pdf
What is ETL? Difference between ETL and ELT?.pdf
SaikatBasu37
 
Fundamentals of Data Analysis, its types, tools, algorithms
Fundamentals of Data Analysis, its types, tools, algorithmsFundamentals of Data Analysis, its types, tools, algorithms
Fundamentals of Data Analysis, its types, tools, algorithms
priyaiyerkbcsc
 
Lesson 6-Interviewing in SHRM_updated.pdf
Lesson 6-Interviewing in SHRM_updated.pdfLesson 6-Interviewing in SHRM_updated.pdf
Lesson 6-Interviewing in SHRM_updated.pdf
hemelali11
 
Sets theories and applications that can used to imporve knowledge
Sets theories and applications that can used to imporve knowledgeSets theories and applications that can used to imporve knowledge
Sets theories and applications that can used to imporve knowledge
saumyasl2020
 
Dynamics 365 Business Rules Dynamics Dynamics
Dynamics 365 Business Rules Dynamics DynamicsDynamics 365 Business Rules Dynamics Dynamics
Dynamics 365 Business Rules Dynamics Dynamics
heyoubro69
 
Day 1 MS Excel Basics #.pptxDay 1 MS Excel Basics #.pptxDay 1 MS Excel Basics...
Day 1 MS Excel Basics #.pptxDay 1 MS Excel Basics #.pptxDay 1 MS Excel Basics...Day 1 MS Excel Basics #.pptxDay 1 MS Excel Basics #.pptxDay 1 MS Excel Basics...
Day 1 MS Excel Basics #.pptxDay 1 MS Excel Basics #.pptxDay 1 MS Excel Basics...
Jayantilal Bhanushali
 
Mining a Global Trade Process with Data Science - Microsoft
Mining a Global Trade Process with Data Science - MicrosoftMining a Global Trade Process with Data Science - Microsoft
Mining a Global Trade Process with Data Science - Microsoft
Process mining Evangelist
 
Oral Malodor.pptx jsjshdhushehsidjjeiejdhfj
Oral Malodor.pptx jsjshdhushehsidjjeiejdhfjOral Malodor.pptx jsjshdhushehsidjjeiejdhfj
Oral Malodor.pptx jsjshdhushehsidjjeiejdhfj
maitripatel5301
 
lecture_13 tree in mmmmmmmm mmmmmfftro.pptx
lecture_13 tree in mmmmmmmm     mmmmmfftro.pptxlecture_13 tree in mmmmmmmm     mmmmmfftro.pptx
lecture_13 tree in mmmmmmmm mmmmmfftro.pptx
sarajafffri058
 
Lagos School of Programming Final Project Updated.pdf
Lagos School of Programming Final Project Updated.pdfLagos School of Programming Final Project Updated.pdf
Lagos School of Programming Final Project Updated.pdf
benuju2016
 
Feature Engineering for Electronic Health Record Systems
Feature Engineering for Electronic Health Record SystemsFeature Engineering for Electronic Health Record Systems
Feature Engineering for Electronic Health Record Systems
Process mining Evangelist
 
录取通知书加拿大TMU毕业证多伦多都会大学电子版毕业证成绩单
录取通知书加拿大TMU毕业证多伦多都会大学电子版毕业证成绩单录取通知书加拿大TMU毕业证多伦多都会大学电子版毕业证成绩单
录取通知书加拿大TMU毕业证多伦多都会大学电子版毕业证成绩单
Taqyea
 
Process Mining as Enabler for Digital Transformations
Process Mining as Enabler for Digital TransformationsProcess Mining as Enabler for Digital Transformations
Process Mining as Enabler for Digital Transformations
Process mining Evangelist
 
RAG Chatbot using AWS Bedrock and Streamlit Framework
RAG Chatbot using AWS Bedrock and Streamlit FrameworkRAG Chatbot using AWS Bedrock and Streamlit Framework
RAG Chatbot using AWS Bedrock and Streamlit Framework
apanneer
 
Introduction to systems thinking tools_Eng.pdf
Introduction to systems thinking tools_Eng.pdfIntroduction to systems thinking tools_Eng.pdf
Introduction to systems thinking tools_Eng.pdf
AbdurahmanAbd
 
2-Raction quotient_١٠٠١٤٦.ppt of physical chemisstry
2-Raction quotient_١٠٠١٤٦.ppt of physical chemisstry2-Raction quotient_١٠٠١٤٦.ppt of physical chemisstry
2-Raction quotient_١٠٠١٤٦.ppt of physical chemisstry
bastakwyry
 
Publication-launch-How-is-Life-for-Children-in-the-Digital-Age-15-May-2025.pdf
Publication-launch-How-is-Life-for-Children-in-the-Digital-Age-15-May-2025.pdfPublication-launch-How-is-Life-for-Children-in-the-Digital-Age-15-May-2025.pdf
Publication-launch-How-is-Life-for-Children-in-the-Digital-Age-15-May-2025.pdf
StatsCommunications
 
AWS RDS Presentation to make concepts easy.pptx
AWS RDS Presentation to make concepts easy.pptxAWS RDS Presentation to make concepts easy.pptx
AWS RDS Presentation to make concepts easy.pptx
bharatkumarbhojwani
 
Z14_IBM__APL_by_Christian_Demmer_IBM.pdf
Z14_IBM__APL_by_Christian_Demmer_IBM.pdfZ14_IBM__APL_by_Christian_Demmer_IBM.pdf
Z14_IBM__APL_by_Christian_Demmer_IBM.pdf
Fariborz Seyedloo
 
What is ETL? Difference between ETL and ELT?.pdf
What is ETL? Difference between ETL and ELT?.pdfWhat is ETL? Difference between ETL and ELT?.pdf
What is ETL? Difference between ETL and ELT?.pdf
SaikatBasu37
 
Fundamentals of Data Analysis, its types, tools, algorithms
Fundamentals of Data Analysis, its types, tools, algorithmsFundamentals of Data Analysis, its types, tools, algorithms
Fundamentals of Data Analysis, its types, tools, algorithms
priyaiyerkbcsc
 
Lesson 6-Interviewing in SHRM_updated.pdf
Lesson 6-Interviewing in SHRM_updated.pdfLesson 6-Interviewing in SHRM_updated.pdf
Lesson 6-Interviewing in SHRM_updated.pdf
hemelali11
 
Sets theories and applications that can used to imporve knowledge
Sets theories and applications that can used to imporve knowledgeSets theories and applications that can used to imporve knowledge
Sets theories and applications that can used to imporve knowledge
saumyasl2020
 
Dynamics 365 Business Rules Dynamics Dynamics
Dynamics 365 Business Rules Dynamics DynamicsDynamics 365 Business Rules Dynamics Dynamics
Dynamics 365 Business Rules Dynamics Dynamics
heyoubro69
 
Day 1 MS Excel Basics #.pptxDay 1 MS Excel Basics #.pptxDay 1 MS Excel Basics...
Day 1 MS Excel Basics #.pptxDay 1 MS Excel Basics #.pptxDay 1 MS Excel Basics...Day 1 MS Excel Basics #.pptxDay 1 MS Excel Basics #.pptxDay 1 MS Excel Basics...
Day 1 MS Excel Basics #.pptxDay 1 MS Excel Basics #.pptxDay 1 MS Excel Basics...
Jayantilal Bhanushali
 
Mining a Global Trade Process with Data Science - Microsoft
Mining a Global Trade Process with Data Science - MicrosoftMining a Global Trade Process with Data Science - Microsoft
Mining a Global Trade Process with Data Science - Microsoft
Process mining Evangelist
 
Oral Malodor.pptx jsjshdhushehsidjjeiejdhfj
Oral Malodor.pptx jsjshdhushehsidjjeiejdhfjOral Malodor.pptx jsjshdhushehsidjjeiejdhfj
Oral Malodor.pptx jsjshdhushehsidjjeiejdhfj
maitripatel5301
 
lecture_13 tree in mmmmmmmm mmmmmfftro.pptx
lecture_13 tree in mmmmmmmm     mmmmmfftro.pptxlecture_13 tree in mmmmmmmm     mmmmmfftro.pptx
lecture_13 tree in mmmmmmmm mmmmmfftro.pptx
sarajafffri058
 
Lagos School of Programming Final Project Updated.pdf
Lagos School of Programming Final Project Updated.pdfLagos School of Programming Final Project Updated.pdf
Lagos School of Programming Final Project Updated.pdf
benuju2016
 
Feature Engineering for Electronic Health Record Systems
Feature Engineering for Electronic Health Record SystemsFeature Engineering for Electronic Health Record Systems
Feature Engineering for Electronic Health Record Systems
Process mining Evangelist
 
录取通知书加拿大TMU毕业证多伦多都会大学电子版毕业证成绩单
录取通知书加拿大TMU毕业证多伦多都会大学电子版毕业证成绩单录取通知书加拿大TMU毕业证多伦多都会大学电子版毕业证成绩单
录取通知书加拿大TMU毕业证多伦多都会大学电子版毕业证成绩单
Taqyea
 
Process Mining as Enabler for Digital Transformations
Process Mining as Enabler for Digital TransformationsProcess Mining as Enabler for Digital Transformations
Process Mining as Enabler for Digital Transformations
Process mining Evangelist
 
RAG Chatbot using AWS Bedrock and Streamlit Framework
RAG Chatbot using AWS Bedrock and Streamlit FrameworkRAG Chatbot using AWS Bedrock and Streamlit Framework
RAG Chatbot using AWS Bedrock and Streamlit Framework
apanneer
 
Introduction to systems thinking tools_Eng.pdf
Introduction to systems thinking tools_Eng.pdfIntroduction to systems thinking tools_Eng.pdf
Introduction to systems thinking tools_Eng.pdf
AbdurahmanAbd
 

Natural Language Query and Conversational Interface to Apache Spark

  • 1. Introduction to Auto Loader How to easily ingest PBs of data into your Delta lakehouse Pranav Anand Software Engineer at Databricks
  • 2. s3:/logs Cloud Storage Input files containing unstructured and semi-structured data Final data Building a data pipeline
  • 3. Building a data pipeline: Step one s3:/logs Cloud Storage Structured Streaming Input files containing unstructured and semi-structured data Structured tables Data Ingestion
  • 4. s3:/logs Cloud Storage Structured Streaming Input files Structured tables “Here be dragons” Building a data pipeline: Step one
  • 5. input path: s3:/logs Cloud Storage File Stream Source Final data a.json, b.json List on trigger at t = 0 Transformed data Challenge: Scalability
  • 6. input path: s3:/logs Cloud Storage File Stream Source Final data c.json, a.json, b.json List on trigger at t = 5 Transformed data Challenge: Scalability Seen file paths a.json b.json
  • 7. input path: s3:/logs Cloud Storage File Stream Source Final data c.json, a.json, b.json List on trigger at t = 5 Transformed data Challenge: Scalability Seen file paths a.json b.json Repeated listing is slow and expensive
  • 8. input path: s3:/logs Cloud Storage File Stream Source Final data c.json, a.json, b.json List on trigger at t = 5 Transformed data Challenge: Scalability Seen file paths map a.json b.json In-memory map does not scale Repeated listing is slow and expensive
  • 9. Challenge: Schema Cloud Storage { id: 5 name: “John” } { id: 7 name: “Amy” } a.json b.json Manually infer and set schema as id: Int name: String Ready to go! Input files
  • 11. Challenge: Schema Cloud Storage Input files { id: 18, name: “Olivia” } { id: 23, name: “Alex”, age: 31 } e.json f.json Data loss! Manually handle this column. User intervention needed every time. How do I deal with all this?!
  • 12. Auto Loader ▪ New Structured Streaming source ▪ Solves biggest ingestion challenges: ▪ Scalability ▪ Schema management
  • 13. Challenge: Scalability File Notification Mode s3:/logs Cloud Storage Input files
  • 14. Challenge: Scalability File Notification Mode s3:/logs Cloud Storage Input files a.json File notification generated for a.json a.json s3:/logs
  • 15. Challenge: Scalability File Notification Mode Input files Cloud files source Pull a.json from queue Delete a.json from queue once ingested a.json s3:/logs Cloud Storage a.json File notification generated for a.json s3:/logs No listing! Files ingested as they arrive.
  • 16. Challenge: Scalability File Notification Mode Input files Pull a.json from queue Delete a.json from queue once ingested a.json s3:/logs Cloud Storage a.json File notification generated for a.json s3:/logs Cloud files source Seen file paths map in RocksDB a.json RocksDB deduplication means no scalability limits
  • 17. Challenge: Scalability Backfill Input files Pull a.json from cloud queue Delete a.json from queue once ingested a.json Cloud Storage a.json File notification generated for a.json Cloud files source Seen file paths map in RocksDB a.json A.json a.json, A.json Include existing files A.json Internal queue a.json A.json Existing files
  • 18. Challenge: Scalability File Notification Mode spark .readStream .format(“cloudFiles”) .option(“cloudFiles.format”, “json”) .option(“cloudFiles.useNotifications”, “true”) .option(“cloudFiles.includeExistingFiles”, “true”) .load() ▪ No repeated listing ▪ Scalable to many millions of files ▪ Done simply
  • 19. Challenge: Schema Cloud Storage Input files a.json Starting off more simply { id: 5 name: “John” } { id: 7 name: “Amy” } b.json Automatically infer and set schema as id: Int name: String Ready to go!
  • 20. Challenge: Schema Cloud Storage Input files { id: 18, name: “Olivia” } { id: 23, name: “Alex”, age: 31 } e.json f.json Where we hit a dead end... How do I deal with all this?! Data loss! Manually handle this column. User intervention needed every time.
  • 21. Challenge: Schema Cloud Storage Input files { id: 18, name: “Olivia” } { id: 23, name: “Alex”, age: 31 } e.json f.json What we can do about it How do I deal with all this?! + Auto Loader Data loss! Manually handle this column. User intervention needed every time.
  • 22. Challenge: Schema Cloud Storage Input files { id: 18, name: “Olivia” } { id: 23, name: “Alex”, age: 31 } e.json f.json Self-sustaining stream - Evolve schema Schema was inferred as id: Int name: String New schema: id: Int name: String age: Int
  • 23. Challenge: Schema Cloud Storage Input files { id: 18, name: “Olivia” } { id: 23, name: “Alex”, age: 31 } e.json f.json Self-sustaining stream - Evolve schema Schema was inferred as id: Int name: String New schema: id: Int name: String age: Int Set and forget! .option(“cloudFiles.schemaEvolutionMode”, “addNewColumns”)
  • 24. Challenge: Schema Cloud Storage Input files { id: 18, name: “Olivia” } { id: 23, name: “Alex”, age: 31 } e.json f.json Self-sustaining stream: Rescue data Schema was inferred as id: Int name: String _rescued_data: String New schema: id: Int name: String _rescued_data: String Set and forget! .option(“cloudFiles.schemaEvolutionMode”, “rescue”)
  • 25. Challenge: Schema Cloud Storage Input files { id: 18, name: “Olivia” } { id: 23, name: “Alex”, age: 31 } e.json f.json Self-sustaining stream: Rescue data Schema was inferred as id: Int name: String _rescued_data: String New schema: id: Int name: String _rescued_data: String Set and forget! .option(“cloudFiles.schemaEvolutionMode”, “rescue”) Other modes like “failOnNewColumns” and “none”
  • 26. Let’s look at the code
  • 27. In practice ▪ 50+ TB per day of logs ingested at Databricks using Auto Loader ▪ Customers have used Auto Loader to ingest 10s of PB of data
  • 28. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.
  翻译: