SlideShare a Scribd company logo
Pulkit Bhanot, Amit Nene
Risk Platform
Large-scale Feature Aggregation Using
Apache Spark
#Dev1SAIS
#Dev1SAIS
• Motivation
• Challenges
• Architecture Deep Dive
• Role of Spark
• Takeaways
2
Agenda
#Dev1SAIS
Build a scalable, self-service Feature Engineering Platform for
predictive decisioning (based on ML and Business Rules)
• Feature Engineering: use of domain knowledge to create Features
• Self-Service for Data Scientists and Analysts without reliance on Engineers
• Generic Platform: consolidating work towards wider ML efforts at Uber
3
Team Mission
#Dev1SAIS
Promotions
Detect and prevent bad actors in real-time
number of trips over X hours/weeks
trips cancelled over Y months
count of referrals over lifetime
...
4
Payments
Sample use case: Fraud detection
#Dev1SAIS
Indexed
databases
Streaming
aggregations
Needs
• Lifetime of entity
• Sliding long-window: days/weeks/months
• Sliding short-window: mins/hours
• Real-time
Existing solutions
• None satisfies all of above
• Complex onboarding
Warehouse
None
fits the
bill!
5
Time Series Aggregations
#Dev1SAIS
• Scale: 1000s of aggregations for 100s million of
business entities
• Long-window aggregation queries slow even with
indexes (seconds). Millis at high QPS needed.
• Onboarding complexity: many moving parts
• Recovery from accrual of errors
6
Technical Challenges
#Dev1SAIS
One-stop shop for aggregation
• Single system to interact with
• Single spec: automate configurations of underlying system
Scalable
• Leverage the scale of Batch system for long window
• Combine with real-time aggregation for freshness
• Rollups: aggregate over time intervals
• Fast query over rolled up aggregates
• Caveat: summable functions
Self-healing
• Batch system auto-corrects any errors
7
Our approach
#Dev1SAIS
Aggregator
Aggregated
Features
Raw Events:
Streaming+Hive
Aggregation
Function:
sum, count, etc.
Aggregation
Window:
LTD, 7 days, 30
days, etc.
Grain Size:
5 min (realtime),
1 day (offline), etc.
Input parameters to black
box
➔ Source events
➔ Grain size
➔ Aggregation
functions
➔ Aggregation windows
8
Aggregator as a black box
#Dev1SAIS
Specs
Feature Store
Batch
Aggregator
(Spark Apps)
Real-time
Aggregator
(Streaming)
Feature Access
(Microservice)
• Batch (Spark)
– Long-window: weeks, months
– Bootstrap, incremental modes
• Streaming (e.g. Kafka events)
– Short-window (<24 hrs)
– Near-real time
• Real-time Access
– Merge offline and streaming
• Feature Store
– Save rolled-up aggregates in Hive
and Cassandra
9
1
1
2
2
3
4
Overall architecture
#Dev1SAIS
Computation
Hive
Specs
Batch Aggregator (Spark Apps)
Feature Store
(Cassandra)
Feature
Extractor
Scheduler
Rollup
Generator
Bootstrap
Periodic
Snapshot
Manager
Optimizer
Feature
Access
Tbl1:<2018-04-10>
Tb1:<2018-04-09>
Optimized
Snapshot
Full Snapshot
Incremental
Snapshot
Dispersal
Decisioning
System
10
Batch Spark Engine
1
2
3
4
5
6
7
8
9
10
Optimized
Snapshot
#Dev1SAIS
Hive
Tbl1:<2018-04-13> ATable-1_daily:<2018-04-13>
ATable-1_LTD:<2018-04-13>
Table-1
-partition-<2018-04-10>
col1, col2, col3, col4
-partition-<2018-04-11>
col1, col2, col3, col4
-partition-<2018-04-12>
col1, col2, col3, col4
-partition-<2018-04-13>
col1, col2, col3, col4
Tbl1:<2018-04-10>
Tbl1:<2018-04-13>
…….
ATable-1_Lifetime
-partition-<2018-04-10>
uuid, f1_ltd, f2_ltd
-partition-<2018-04-11>
uuid, f1_ltd, f2_ltd
-partition-<2018-04-12>
uuid, f1_ltd, f2_ltd
-partition-<2018-04-13>
uuid, f1_ltd, f2_ltd
ATable-1_daily
-partition-<2018-04-10>
uuid, f1, f2
-partition-<2018-04-11>
uuid, f1, f2
-partition-<2018-04-12>
uuid, f1, f2
-partition-<2018-04-13>
uuid, f1, f2
ATable-1_joined
-partition-<2018-04-10>
uuid, f1, f2, f1_ltd, f2_ltd
-partition-<2018-04-11>
uuid, f1, f2, f1_ltd, f2_ltd
-partition-<2018-04-12>
uuid, f1, f2, f1_ltd, f2_ltd
-partition-<2018-04-13>
uuid, f1, f2, f1_ltd, f2_ltd
Daily Partitioned
Source Tables
Rolled-up Tables
Features involving Lifetime computation.
Features involving sliding window
computation.
Dispersed to real-
time store
11
Batch Storage
Daily Lifetime
Snapshot
Daily Incremental
Rollup
#Dev1SAIS 12
● Orchestrator of ETL pipelines
○ Scheduling of subtasks
○ Record incremental progress
● Optimally resize HDFS files: scale with
size of data set.
● Rich set of APIs to enable complex
optimizations
e.g of an optimization in bootstrap dispersal
dailyDataset.join(
ltdData,
JavaConverters.asScalaIteratorConverter(
Arrays.asList(pipelineConfig.getEntityKey()).iterator())
.asScala()
.toSeq(),
"outer");
uuid _ltd daily_buckets
44b7dc88 1534 [{"2017-10-24":"4"},{"2017-08-
22":"3"},{"2017-09-21":"4"},{"2017-
08-08":"3"},{"2017-10-
03":"3"},{"2017-10-19":"5"},{"2017-
09-06":"1"},{"2017-08-
17":"5"},{"2017-09-09":"12"},{"2017-
10-05":"5"},{"2017-09-
25":"4"},{"2017-09-17":"13"}]
Role of Spark
#Dev1SAIS 13
• Ability to disperse billions of records
– HashPartitioner to the rescue
//Partition the data by hash
HashPartitioner hashPartitioner = new HashPartitioner(partitionNumber);
JavaPairRDD<String, Row> hashedRDD = keyedRDD.partitionBy(hashPartitioner);
//Fetch each hash partition and process
foreach partition{
JavaRDD<Tuple2<String, Row>> filteredHashRDD = filterRows(hashedRDD, index, paritionId);
raise error if partition mismatch
Dataset<Row> filteredDataSet =
etlContext.getSparkSession().createDataset(filteredHashRDD.map(tuple -> tuple._2()).rdd(),
data.org$apache$spark$sql$Dataset$$encoder);
//repartition filteredDataSet, update checkpoint and records processed after successful
completion.
Paren
t
RDD
P1
P2
P3
Pn
…..
Process
Each
Partition
Role of Spark (continued)
#Dev1SAIS 14
2018-02-01
2018-02-02
2018-02-03
2018-03-01
….
Dispersal C*
2018-02-01
2018-02-02
2018-02-03
2018-03-01
Bootstrap
• Global Throttling
– Feature Store can be the bottleneck
– coalesce() to limit the executors
• Inspect data
– Disperse only if any column has
changed
• Monitoring and alert
– create custom metrics
Role of Spark in Dispersal
Full computation snapshots
Optimized snapshots
#Dev1SAIS
● Real-time, summable
aggregations for < 24 hours
● Semantically equivalent to
offline computation
● Aggregation rollups (5 mins)
maintained in feature store
(Cassandra)
event
enrichment
raw kafka
events
microservices
xform_0
xform_1
xform_2
streaming
computation
pipelines
time
window
aggregator
C*
RPCs
Uber Athena streaming
15
Real-time streaming engine
#Dev1SAIS
● Uses time series and clustering
key support in Cassandra
○ 1 table for Lifetime & LTD
values.
○ Multiple tables for realtime
values with grain size 5M
● Consult metadata and assemble
into single result at feature
access time
entity_common_aggr_bt_ltd
UUID
(PK)
trip_count_ltd
entity_common_aggr_bt
UUID
(PK)
eventbucket
(CK)
trip_count
entity_common_aggr_rt_2018_05_08
entity_common_aggr_rt_2018_05_09
entity_common_aggr_rt_2018_05_10
UUID
(PK)
eventbucket
(CK)
trip_coun
t
Feature access Service
Metadata
Service
Query
Planner
e.g - lifetime trip count
- trips over last 51 hrs
- trips over previous 2
days
16
Final aggregation in real time
#Dev1SAIS
Create Query
(Spark SQL)
Configure Spec
Commit to Prod
Test Spec
17
Self-service onboarding
#Dev1SAIS
Backfill Support: what is the value of a feature f1 for
an entity E1 from Thist to Tnow
• Bootstrap to historic point in time: Thist
• Incrementally compute from Thist to Tnow
How ?
• Lifetime: feature f1 on Thist access partition Thist
• Windowed: feature f2 on Thist with window N days
• Merge partitions Thist-N to Thist
18
Machine learning support
T-120
T-119
T-90
T-1
T
…..
…..
Lifetime
value on a
given date
Last 30 day
trips at T-90
Last 30 day
trips at T-89
#Dev1SAIS
● Use of Spark to achieve massive scale
● Combine with Streaming aggregation for freshness
● Low latency access in production (P99 <= 20ms) at high QPS
● Simplify onboarding via single spec, onboarding time in hours
● Huge computational cost improvements
19
Takeaways
Proprietary and confidential © 2018 Uber Technologies, Inc. All rights reserved. No part of this
document may be reproduced or utilized in any form or by any means, electronic or mechanical,
including photocopying, recording, or by any information storage or retrieval systems, without
permission in writing from Uber. This document is intended only for the use of the individual or entity
to whom it is addressed and contains information that is privileged, confidential or otherwise exempt
from disclosure under applicable law. All recipients of this document are notified that the information
contained herein includes proprietary and confidential information of Uber, and recipient may not
make use of, disseminate, or in any way disclose this document or any of the enclosed information
to any person other than employees of addressee to the extent necessary for consultations with
authorized personnel of Uber.
bhanotp@uber.com
anene@uber.com
Ad

More Related Content

What's hot (20)

SQL Performance Tuning and New Features in Oracle 19c
SQL Performance Tuning and New Features in Oracle 19cSQL Performance Tuning and New Features in Oracle 19c
SQL Performance Tuning and New Features in Oracle 19c
RachelBarker26
 
MySQL Buffer Management
MySQL Buffer ManagementMySQL Buffer Management
MySQL Buffer Management
MIJIN AN
 
Feature Engineering in H2O Driverless AI - Dmitry Larko - H2O AI World London...
Feature Engineering in H2O Driverless AI - Dmitry Larko - H2O AI World London...Feature Engineering in H2O Driverless AI - Dmitry Larko - H2O AI World London...
Feature Engineering in H2O Driverless AI - Dmitry Larko - H2O AI World London...
Sri Ambati
 
ClickHouse Features for Advanced Users, by Aleksei Milovidov
ClickHouse Features for Advanced Users, by Aleksei MilovidovClickHouse Features for Advanced Users, by Aleksei Milovidov
ClickHouse Features for Advanced Users, by Aleksei Milovidov
Altinity Ltd
 
Sizing MongoDB Clusters
Sizing MongoDB Clusters Sizing MongoDB Clusters
Sizing MongoDB Clusters
MongoDB
 
jemalloc 세미나
jemalloc 세미나jemalloc 세미나
jemalloc 세미나
Jang Hoon
 
ScyllaDBユーザー勉強会 #1
ScyllaDBユーザー勉強会 #1ScyllaDBユーザー勉強会 #1
ScyllaDBユーザー勉強会 #1
Changhwan Lee
 
Google Cloud Spanner Preview
Google Cloud Spanner PreviewGoogle Cloud Spanner Preview
Google Cloud Spanner Preview
DoiT International
 
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Spark Summit
 
Why you should care about data layout in the file system with Cheng Lian and ...
Why you should care about data layout in the file system with Cheng Lian and ...Why you should care about data layout in the file system with Cheng Lian and ...
Why you should care about data layout in the file system with Cheng Lian and ...
Databricks
 
Kibana + Winlogbeatで実現:Windowsのログ分析入門
Kibana + Winlogbeatで実現:Windowsのログ分析入門Kibana + Winlogbeatで実現:Windowsのログ分析入門
Kibana + Winlogbeatで実現:Windowsのログ分析入門
Yuki Nakai
 
YugaByte DB Internals - Storage Engine and Transactions
YugaByte DB Internals - Storage Engine and Transactions YugaByte DB Internals - Storage Engine and Transactions
YugaByte DB Internals - Storage Engine and Transactions
Yugabyte
 
Battle of the Stream Processing Titans – Flink versus RisingWave
Battle of the Stream Processing Titans – Flink versus RisingWaveBattle of the Stream Processing Titans – Flink versus RisingWave
Battle of the Stream Processing Titans – Flink versus RisingWave
Yingjun Wu
 
Large Scale Graph Analytics with JanusGraph
Large Scale Graph Analytics with JanusGraphLarge Scale Graph Analytics with JanusGraph
Large Scale Graph Analytics with JanusGraph
DataWorks Summit
 
My first 90 days with ClickHouse.pdf
My first 90 days with ClickHouse.pdfMy first 90 days with ClickHouse.pdf
My first 90 days with ClickHouse.pdf
Alkin Tezuysal
 
超高速処理とスケーラビリティを両立するApache GEODE
超高速処理とスケーラビリティを両立するApache GEODE超高速処理とスケーラビリティを両立するApache GEODE
超高速処理とスケーラビリティを両立するApache GEODE
Masaki Yamakawa
 
Exactly-Once Semantics Revisited: Distributed Transactions across Flink and K...
Exactly-Once Semantics Revisited: Distributed Transactions across Flink and K...Exactly-Once Semantics Revisited: Distributed Transactions across Flink and K...
Exactly-Once Semantics Revisited: Distributed Transactions across Flink and K...
HostedbyConfluent
 
Apache Flink Adoption @ Shopify
Apache Flink Adoption @ ShopifyApache Flink Adoption @ Shopify
Apache Flink Adoption @ Shopify
KevinLam737856
 
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
confluent
 
Flink Streaming
Flink StreamingFlink Streaming
Flink Streaming
Gyula Fóra
 
SQL Performance Tuning and New Features in Oracle 19c
SQL Performance Tuning and New Features in Oracle 19cSQL Performance Tuning and New Features in Oracle 19c
SQL Performance Tuning and New Features in Oracle 19c
RachelBarker26
 
MySQL Buffer Management
MySQL Buffer ManagementMySQL Buffer Management
MySQL Buffer Management
MIJIN AN
 
Feature Engineering in H2O Driverless AI - Dmitry Larko - H2O AI World London...
Feature Engineering in H2O Driverless AI - Dmitry Larko - H2O AI World London...Feature Engineering in H2O Driverless AI - Dmitry Larko - H2O AI World London...
Feature Engineering in H2O Driverless AI - Dmitry Larko - H2O AI World London...
Sri Ambati
 
ClickHouse Features for Advanced Users, by Aleksei Milovidov
ClickHouse Features for Advanced Users, by Aleksei MilovidovClickHouse Features for Advanced Users, by Aleksei Milovidov
ClickHouse Features for Advanced Users, by Aleksei Milovidov
Altinity Ltd
 
Sizing MongoDB Clusters
Sizing MongoDB Clusters Sizing MongoDB Clusters
Sizing MongoDB Clusters
MongoDB
 
jemalloc 세미나
jemalloc 세미나jemalloc 세미나
jemalloc 세미나
Jang Hoon
 
ScyllaDBユーザー勉強会 #1
ScyllaDBユーザー勉強会 #1ScyllaDBユーザー勉強会 #1
ScyllaDBユーザー勉強会 #1
Changhwan Lee
 
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Spark Summit
 
Why you should care about data layout in the file system with Cheng Lian and ...
Why you should care about data layout in the file system with Cheng Lian and ...Why you should care about data layout in the file system with Cheng Lian and ...
Why you should care about data layout in the file system with Cheng Lian and ...
Databricks
 
Kibana + Winlogbeatで実現:Windowsのログ分析入門
Kibana + Winlogbeatで実現:Windowsのログ分析入門Kibana + Winlogbeatで実現:Windowsのログ分析入門
Kibana + Winlogbeatで実現:Windowsのログ分析入門
Yuki Nakai
 
YugaByte DB Internals - Storage Engine and Transactions
YugaByte DB Internals - Storage Engine and Transactions YugaByte DB Internals - Storage Engine and Transactions
YugaByte DB Internals - Storage Engine and Transactions
Yugabyte
 
Battle of the Stream Processing Titans – Flink versus RisingWave
Battle of the Stream Processing Titans – Flink versus RisingWaveBattle of the Stream Processing Titans – Flink versus RisingWave
Battle of the Stream Processing Titans – Flink versus RisingWave
Yingjun Wu
 
Large Scale Graph Analytics with JanusGraph
Large Scale Graph Analytics with JanusGraphLarge Scale Graph Analytics with JanusGraph
Large Scale Graph Analytics with JanusGraph
DataWorks Summit
 
My first 90 days with ClickHouse.pdf
My first 90 days with ClickHouse.pdfMy first 90 days with ClickHouse.pdf
My first 90 days with ClickHouse.pdf
Alkin Tezuysal
 
超高速処理とスケーラビリティを両立するApache GEODE
超高速処理とスケーラビリティを両立するApache GEODE超高速処理とスケーラビリティを両立するApache GEODE
超高速処理とスケーラビリティを両立するApache GEODE
Masaki Yamakawa
 
Exactly-Once Semantics Revisited: Distributed Transactions across Flink and K...
Exactly-Once Semantics Revisited: Distributed Transactions across Flink and K...Exactly-Once Semantics Revisited: Distributed Transactions across Flink and K...
Exactly-Once Semantics Revisited: Distributed Transactions across Flink and K...
HostedbyConfluent
 
Apache Flink Adoption @ Shopify
Apache Flink Adoption @ ShopifyApache Flink Adoption @ Shopify
Apache Flink Adoption @ Shopify
KevinLam737856
 
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
confluent
 

Similar to Large Scale Feature Aggregation Using Apache Spark with Pulkit Bhanot and Amit Nene (20)

Flink powered stream processing platform at Pinterest
Flink powered stream processing platform at PinterestFlink powered stream processing platform at Pinterest
Flink powered stream processing platform at Pinterest
Flink Forward
 
Flink Forward San Francisco 2018: Robert Metzger & Patrick Lucas - "dA Platfo...
Flink Forward San Francisco 2018: Robert Metzger & Patrick Lucas - "dA Platfo...Flink Forward San Francisco 2018: Robert Metzger & Patrick Lucas - "dA Platfo...
Flink Forward San Francisco 2018: Robert Metzger & Patrick Lucas - "dA Platfo...
Flink Forward
 
dA Platform Overview
dA Platform OverviewdA Platform Overview
dA Platform Overview
Robert Metzger
 
Uber Business Metrics Generation and Management Through Apache Flink
Uber Business Metrics Generation and Management Through Apache FlinkUber Business Metrics Generation and Management Through Apache Flink
Uber Business Metrics Generation and Management Through Apache Flink
Wenrui Meng
 
Lessons learned from embedding Cassandra in xPatterns
Lessons learned from embedding Cassandra in xPatternsLessons learned from embedding Cassandra in xPatterns
Lessons learned from embedding Cassandra in xPatterns
Claudiu Barbura
 
Big data Argentina meetup 2020-09: Intro to presto on docker
Big data Argentina meetup 2020-09: Intro to presto on dockerBig data Argentina meetup 2020-09: Intro to presto on docker
Big data Argentina meetup 2020-09: Intro to presto on docker
Federico Palladoro
 
Shaik Niyas Ahamed M Resume
Shaik Niyas Ahamed M ResumeShaik Niyas Ahamed M Resume
Shaik Niyas Ahamed M Resume
Shaik Niyas Ahamed M
 
Openstack Summit Tokyo 2015 - Building a private cloud to efficiently handle ...
Openstack Summit Tokyo 2015 - Building a private cloud to efficiently handle ...Openstack Summit Tokyo 2015 - Building a private cloud to efficiently handle ...
Openstack Summit Tokyo 2015 - Building a private cloud to efficiently handle ...
Pierre GRANDIN
 
[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary
[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary
[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary
Hiram Fleitas León
 
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...
GetInData
 
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with KubernetesKubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
SeungYong Oh
 
Lyft data Platform - 2019 slides
Lyft data Platform - 2019 slidesLyft data Platform - 2019 slides
Lyft data Platform - 2019 slides
Karthik Murugesan
 
The Lyft data platform: Now and in the future
The Lyft data platform: Now and in the futureThe Lyft data platform: Now and in the future
The Lyft data platform: Now and in the future
markgrover
 
Adtech x Scala x Performance tuning
Adtech x Scala x Performance tuningAdtech x Scala x Performance tuning
Adtech x Scala x Performance tuning
Yosuke Mizutani
 
Adtech scala-performance-tuning-150323223738-conversion-gate01
Adtech scala-performance-tuning-150323223738-conversion-gate01Adtech scala-performance-tuning-150323223738-conversion-gate01
Adtech scala-performance-tuning-150323223738-conversion-gate01
Giridhar Addepalli
 
Real-Time Health Score Application using Apache Spark on Kubernetes
Real-Time Health Score Application using Apache Spark on KubernetesReal-Time Health Score Application using Apache Spark on Kubernetes
Real-Time Health Score Application using Apache Spark on Kubernetes
Databricks
 
OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...
OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...
OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...
NETWAYS
 
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and DruidPulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Tony Ng
 
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
DataStax
 
Track A-2 基於 Spark 的數據分析
Track A-2 基於 Spark 的數據分析Track A-2 基於 Spark 的數據分析
Track A-2 基於 Spark 的數據分析
Etu Solution
 
Flink powered stream processing platform at Pinterest
Flink powered stream processing platform at PinterestFlink powered stream processing platform at Pinterest
Flink powered stream processing platform at Pinterest
Flink Forward
 
Flink Forward San Francisco 2018: Robert Metzger & Patrick Lucas - "dA Platfo...
Flink Forward San Francisco 2018: Robert Metzger & Patrick Lucas - "dA Platfo...Flink Forward San Francisco 2018: Robert Metzger & Patrick Lucas - "dA Platfo...
Flink Forward San Francisco 2018: Robert Metzger & Patrick Lucas - "dA Platfo...
Flink Forward
 
Uber Business Metrics Generation and Management Through Apache Flink
Uber Business Metrics Generation and Management Through Apache FlinkUber Business Metrics Generation and Management Through Apache Flink
Uber Business Metrics Generation and Management Through Apache Flink
Wenrui Meng
 
Lessons learned from embedding Cassandra in xPatterns
Lessons learned from embedding Cassandra in xPatternsLessons learned from embedding Cassandra in xPatterns
Lessons learned from embedding Cassandra in xPatterns
Claudiu Barbura
 
Big data Argentina meetup 2020-09: Intro to presto on docker
Big data Argentina meetup 2020-09: Intro to presto on dockerBig data Argentina meetup 2020-09: Intro to presto on docker
Big data Argentina meetup 2020-09: Intro to presto on docker
Federico Palladoro
 
Openstack Summit Tokyo 2015 - Building a private cloud to efficiently handle ...
Openstack Summit Tokyo 2015 - Building a private cloud to efficiently handle ...Openstack Summit Tokyo 2015 - Building a private cloud to efficiently handle ...
Openstack Summit Tokyo 2015 - Building a private cloud to efficiently handle ...
Pierre GRANDIN
 
[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary
[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary
[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary
Hiram Fleitas León
 
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...
GetInData
 
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with KubernetesKubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
SeungYong Oh
 
Lyft data Platform - 2019 slides
Lyft data Platform - 2019 slidesLyft data Platform - 2019 slides
Lyft data Platform - 2019 slides
Karthik Murugesan
 
The Lyft data platform: Now and in the future
The Lyft data platform: Now and in the futureThe Lyft data platform: Now and in the future
The Lyft data platform: Now and in the future
markgrover
 
Adtech x Scala x Performance tuning
Adtech x Scala x Performance tuningAdtech x Scala x Performance tuning
Adtech x Scala x Performance tuning
Yosuke Mizutani
 
Adtech scala-performance-tuning-150323223738-conversion-gate01
Adtech scala-performance-tuning-150323223738-conversion-gate01Adtech scala-performance-tuning-150323223738-conversion-gate01
Adtech scala-performance-tuning-150323223738-conversion-gate01
Giridhar Addepalli
 
Real-Time Health Score Application using Apache Spark on Kubernetes
Real-Time Health Score Application using Apache Spark on KubernetesReal-Time Health Score Application using Apache Spark on Kubernetes
Real-Time Health Score Application using Apache Spark on Kubernetes
Databricks
 
OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...
OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...
OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...
NETWAYS
 
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and DruidPulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Tony Ng
 
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
DataStax
 
Track A-2 基於 Spark 的數據分析
Track A-2 基於 Spark 的數據分析Track A-2 基於 Spark 的數據分析
Track A-2 基於 Spark 的數據分析
Etu Solution
 
Ad

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Ad

Recently uploaded (20)

Process Mining at Dimension Data - Jan vermeulen
Process Mining at Dimension Data - Jan vermeulenProcess Mining at Dimension Data - Jan vermeulen
Process Mining at Dimension Data - Jan vermeulen
Process mining Evangelist
 
How to regulate and control your it-outsourcing provider with process mining
How to regulate and control your it-outsourcing provider with process miningHow to regulate and control your it-outsourcing provider with process mining
How to regulate and control your it-outsourcing provider with process mining
Process mining Evangelist
 
report (maam dona subject).pptxhsgwiswhs
report (maam dona subject).pptxhsgwiswhsreport (maam dona subject).pptxhsgwiswhs
report (maam dona subject).pptxhsgwiswhs
AngelPinedaTaguinod
 
Fundamentals of Data Analysis, its types, tools, algorithms
Fundamentals of Data Analysis, its types, tools, algorithmsFundamentals of Data Analysis, its types, tools, algorithms
Fundamentals of Data Analysis, its types, tools, algorithms
priyaiyerkbcsc
 
50_questions_full.pptxdddddddddddddddddd
50_questions_full.pptxdddddddddddddddddd50_questions_full.pptxdddddddddddddddddd
50_questions_full.pptxdddddddddddddddddd
emir73065
 
How to Set Up Process Mining in a Decentralized Organization?
How to Set Up Process Mining in a Decentralized Organization?How to Set Up Process Mining in a Decentralized Organization?
How to Set Up Process Mining in a Decentralized Organization?
Process mining Evangelist
 
2-Raction quotient_١٠٠١٤٦.ppt of physical chemisstry
2-Raction quotient_١٠٠١٤٦.ppt of physical chemisstry2-Raction quotient_١٠٠١٤٦.ppt of physical chemisstry
2-Raction quotient_١٠٠١٤٦.ppt of physical chemisstry
bastakwyry
 
RAG Chatbot using AWS Bedrock and Streamlit Framework
RAG Chatbot using AWS Bedrock and Streamlit FrameworkRAG Chatbot using AWS Bedrock and Streamlit Framework
RAG Chatbot using AWS Bedrock and Streamlit Framework
apanneer
 
AWS RDS Presentation to make concepts easy.pptx
AWS RDS Presentation to make concepts easy.pptxAWS RDS Presentation to make concepts easy.pptx
AWS RDS Presentation to make concepts easy.pptx
bharatkumarbhojwani
 
CS-404 COA COURSE FILE JAN JUN 2025.docx
CS-404 COA COURSE FILE JAN JUN 2025.docxCS-404 COA COURSE FILE JAN JUN 2025.docx
CS-404 COA COURSE FILE JAN JUN 2025.docx
nidarizvitit
 
Transforming health care with ai powered
Transforming health care with ai poweredTransforming health care with ai powered
Transforming health care with ai powered
gowthamarvj
 
Process Mining Machine Recoveries to Reduce Downtime
Process Mining Machine Recoveries to Reduce DowntimeProcess Mining Machine Recoveries to Reduce Downtime
Process Mining Machine Recoveries to Reduce Downtime
Process mining Evangelist
 
录取通知书加拿大TMU毕业证多伦多都会大学电子版毕业证成绩单
录取通知书加拿大TMU毕业证多伦多都会大学电子版毕业证成绩单录取通知书加拿大TMU毕业证多伦多都会大学电子版毕业证成绩单
录取通知书加拿大TMU毕业证多伦多都会大学电子版毕业证成绩单
Taqyea
 
Voice Control robotic arm hggyghghgjgjhgjg
Voice Control robotic arm hggyghghgjgjhgjgVoice Control robotic arm hggyghghgjgjhgjg
Voice Control robotic arm hggyghghgjgjhgjg
4mg22ec401
 
real illuminati Uganda agent 0782561496/0756664682
real illuminati Uganda agent 0782561496/0756664682real illuminati Uganda agent 0782561496/0756664682
real illuminati Uganda agent 0782561496/0756664682
way to join real illuminati Agent In Kampala Call/WhatsApp+256782561496/0756664682
 
Lagos School of Programming Final Project Updated.pdf
Lagos School of Programming Final Project Updated.pdfLagos School of Programming Final Project Updated.pdf
Lagos School of Programming Final Project Updated.pdf
benuju2016
 
Chapter 6-3 Introducingthe Concepts .pptx
Chapter 6-3 Introducingthe Concepts .pptxChapter 6-3 Introducingthe Concepts .pptx
Chapter 6-3 Introducingthe Concepts .pptx
PermissionTafadzwaCh
 
AI ------------------------------ W1L2.pptx
AI ------------------------------ W1L2.pptxAI ------------------------------ W1L2.pptx
AI ------------------------------ W1L2.pptx
AyeshaJalil6
 
Agricultural_regionalisation_in_India(Final).pptx
Agricultural_regionalisation_in_India(Final).pptxAgricultural_regionalisation_in_India(Final).pptx
Agricultural_regionalisation_in_India(Final).pptx
mostafaahammed38
 
Feature Engineering for Electronic Health Record Systems
Feature Engineering for Electronic Health Record SystemsFeature Engineering for Electronic Health Record Systems
Feature Engineering for Electronic Health Record Systems
Process mining Evangelist
 
Process Mining at Dimension Data - Jan vermeulen
Process Mining at Dimension Data - Jan vermeulenProcess Mining at Dimension Data - Jan vermeulen
Process Mining at Dimension Data - Jan vermeulen
Process mining Evangelist
 
How to regulate and control your it-outsourcing provider with process mining
How to regulate and control your it-outsourcing provider with process miningHow to regulate and control your it-outsourcing provider with process mining
How to regulate and control your it-outsourcing provider with process mining
Process mining Evangelist
 
report (maam dona subject).pptxhsgwiswhs
report (maam dona subject).pptxhsgwiswhsreport (maam dona subject).pptxhsgwiswhs
report (maam dona subject).pptxhsgwiswhs
AngelPinedaTaguinod
 
Fundamentals of Data Analysis, its types, tools, algorithms
Fundamentals of Data Analysis, its types, tools, algorithmsFundamentals of Data Analysis, its types, tools, algorithms
Fundamentals of Data Analysis, its types, tools, algorithms
priyaiyerkbcsc
 
50_questions_full.pptxdddddddddddddddddd
50_questions_full.pptxdddddddddddddddddd50_questions_full.pptxdddddddddddddddddd
50_questions_full.pptxdddddddddddddddddd
emir73065
 
How to Set Up Process Mining in a Decentralized Organization?
How to Set Up Process Mining in a Decentralized Organization?How to Set Up Process Mining in a Decentralized Organization?
How to Set Up Process Mining in a Decentralized Organization?
Process mining Evangelist
 
2-Raction quotient_١٠٠١٤٦.ppt of physical chemisstry
2-Raction quotient_١٠٠١٤٦.ppt of physical chemisstry2-Raction quotient_١٠٠١٤٦.ppt of physical chemisstry
2-Raction quotient_١٠٠١٤٦.ppt of physical chemisstry
bastakwyry
 
RAG Chatbot using AWS Bedrock and Streamlit Framework
RAG Chatbot using AWS Bedrock and Streamlit FrameworkRAG Chatbot using AWS Bedrock and Streamlit Framework
RAG Chatbot using AWS Bedrock and Streamlit Framework
apanneer
 
AWS RDS Presentation to make concepts easy.pptx
AWS RDS Presentation to make concepts easy.pptxAWS RDS Presentation to make concepts easy.pptx
AWS RDS Presentation to make concepts easy.pptx
bharatkumarbhojwani
 
CS-404 COA COURSE FILE JAN JUN 2025.docx
CS-404 COA COURSE FILE JAN JUN 2025.docxCS-404 COA COURSE FILE JAN JUN 2025.docx
CS-404 COA COURSE FILE JAN JUN 2025.docx
nidarizvitit
 
Transforming health care with ai powered
Transforming health care with ai poweredTransforming health care with ai powered
Transforming health care with ai powered
gowthamarvj
 
Process Mining Machine Recoveries to Reduce Downtime
Process Mining Machine Recoveries to Reduce DowntimeProcess Mining Machine Recoveries to Reduce Downtime
Process Mining Machine Recoveries to Reduce Downtime
Process mining Evangelist
 
录取通知书加拿大TMU毕业证多伦多都会大学电子版毕业证成绩单
录取通知书加拿大TMU毕业证多伦多都会大学电子版毕业证成绩单录取通知书加拿大TMU毕业证多伦多都会大学电子版毕业证成绩单
录取通知书加拿大TMU毕业证多伦多都会大学电子版毕业证成绩单
Taqyea
 
Voice Control robotic arm hggyghghgjgjhgjg
Voice Control robotic arm hggyghghgjgjhgjgVoice Control robotic arm hggyghghgjgjhgjg
Voice Control robotic arm hggyghghgjgjhgjg
4mg22ec401
 
Lagos School of Programming Final Project Updated.pdf
Lagos School of Programming Final Project Updated.pdfLagos School of Programming Final Project Updated.pdf
Lagos School of Programming Final Project Updated.pdf
benuju2016
 
Chapter 6-3 Introducingthe Concepts .pptx
Chapter 6-3 Introducingthe Concepts .pptxChapter 6-3 Introducingthe Concepts .pptx
Chapter 6-3 Introducingthe Concepts .pptx
PermissionTafadzwaCh
 
AI ------------------------------ W1L2.pptx
AI ------------------------------ W1L2.pptxAI ------------------------------ W1L2.pptx
AI ------------------------------ W1L2.pptx
AyeshaJalil6
 
Agricultural_regionalisation_in_India(Final).pptx
Agricultural_regionalisation_in_India(Final).pptxAgricultural_regionalisation_in_India(Final).pptx
Agricultural_regionalisation_in_India(Final).pptx
mostafaahammed38
 
Feature Engineering for Electronic Health Record Systems
Feature Engineering for Electronic Health Record SystemsFeature Engineering for Electronic Health Record Systems
Feature Engineering for Electronic Health Record Systems
Process mining Evangelist
 

Large Scale Feature Aggregation Using Apache Spark with Pulkit Bhanot and Amit Nene

  • 1. Pulkit Bhanot, Amit Nene Risk Platform Large-scale Feature Aggregation Using Apache Spark #Dev1SAIS
  • 2. #Dev1SAIS • Motivation • Challenges • Architecture Deep Dive • Role of Spark • Takeaways 2 Agenda
  • 3. #Dev1SAIS Build a scalable, self-service Feature Engineering Platform for predictive decisioning (based on ML and Business Rules) • Feature Engineering: use of domain knowledge to create Features • Self-Service for Data Scientists and Analysts without reliance on Engineers • Generic Platform: consolidating work towards wider ML efforts at Uber 3 Team Mission
  • 4. #Dev1SAIS Promotions Detect and prevent bad actors in real-time number of trips over X hours/weeks trips cancelled over Y months count of referrals over lifetime ... 4 Payments Sample use case: Fraud detection
  • 5. #Dev1SAIS Indexed databases Streaming aggregations Needs • Lifetime of entity • Sliding long-window: days/weeks/months • Sliding short-window: mins/hours • Real-time Existing solutions • None satisfies all of above • Complex onboarding Warehouse None fits the bill! 5 Time Series Aggregations
  • 6. #Dev1SAIS • Scale: 1000s of aggregations for 100s million of business entities • Long-window aggregation queries slow even with indexes (seconds). Millis at high QPS needed. • Onboarding complexity: many moving parts • Recovery from accrual of errors 6 Technical Challenges
  • 7. #Dev1SAIS One-stop shop for aggregation • Single system to interact with • Single spec: automate configurations of underlying system Scalable • Leverage the scale of Batch system for long window • Combine with real-time aggregation for freshness • Rollups: aggregate over time intervals • Fast query over rolled up aggregates • Caveat: summable functions Self-healing • Batch system auto-corrects any errors 7 Our approach
  • 8. #Dev1SAIS Aggregator Aggregated Features Raw Events: Streaming+Hive Aggregation Function: sum, count, etc. Aggregation Window: LTD, 7 days, 30 days, etc. Grain Size: 5 min (realtime), 1 day (offline), etc. Input parameters to black box ➔ Source events ➔ Grain size ➔ Aggregation functions ➔ Aggregation windows 8 Aggregator as a black box
  • 9. #Dev1SAIS Specs Feature Store Batch Aggregator (Spark Apps) Real-time Aggregator (Streaming) Feature Access (Microservice) • Batch (Spark) – Long-window: weeks, months – Bootstrap, incremental modes • Streaming (e.g. Kafka events) – Short-window (<24 hrs) – Near-real time • Real-time Access – Merge offline and streaming • Feature Store – Save rolled-up aggregates in Hive and Cassandra 9 1 1 2 2 3 4 Overall architecture
  • 10. #Dev1SAIS Computation Hive Specs Batch Aggregator (Spark Apps) Feature Store (Cassandra) Feature Extractor Scheduler Rollup Generator Bootstrap Periodic Snapshot Manager Optimizer Feature Access Tbl1:<2018-04-10> Tb1:<2018-04-09> Optimized Snapshot Full Snapshot Incremental Snapshot Dispersal Decisioning System 10 Batch Spark Engine 1 2 3 4 5 6 7 8 9 10 Optimized Snapshot
  • 11. #Dev1SAIS Hive Tbl1:<2018-04-13> ATable-1_daily:<2018-04-13> ATable-1_LTD:<2018-04-13> Table-1 -partition-<2018-04-10> col1, col2, col3, col4 -partition-<2018-04-11> col1, col2, col3, col4 -partition-<2018-04-12> col1, col2, col3, col4 -partition-<2018-04-13> col1, col2, col3, col4 Tbl1:<2018-04-10> Tbl1:<2018-04-13> ……. ATable-1_Lifetime -partition-<2018-04-10> uuid, f1_ltd, f2_ltd -partition-<2018-04-11> uuid, f1_ltd, f2_ltd -partition-<2018-04-12> uuid, f1_ltd, f2_ltd -partition-<2018-04-13> uuid, f1_ltd, f2_ltd ATable-1_daily -partition-<2018-04-10> uuid, f1, f2 -partition-<2018-04-11> uuid, f1, f2 -partition-<2018-04-12> uuid, f1, f2 -partition-<2018-04-13> uuid, f1, f2 ATable-1_joined -partition-<2018-04-10> uuid, f1, f2, f1_ltd, f2_ltd -partition-<2018-04-11> uuid, f1, f2, f1_ltd, f2_ltd -partition-<2018-04-12> uuid, f1, f2, f1_ltd, f2_ltd -partition-<2018-04-13> uuid, f1, f2, f1_ltd, f2_ltd Daily Partitioned Source Tables Rolled-up Tables Features involving Lifetime computation. Features involving sliding window computation. Dispersed to real- time store 11 Batch Storage Daily Lifetime Snapshot Daily Incremental Rollup
  • 12. #Dev1SAIS 12 ● Orchestrator of ETL pipelines ○ Scheduling of subtasks ○ Record incremental progress ● Optimally resize HDFS files: scale with size of data set. ● Rich set of APIs to enable complex optimizations e.g of an optimization in bootstrap dispersal dailyDataset.join( ltdData, JavaConverters.asScalaIteratorConverter( Arrays.asList(pipelineConfig.getEntityKey()).iterator()) .asScala() .toSeq(), "outer"); uuid _ltd daily_buckets 44b7dc88 1534 [{"2017-10-24":"4"},{"2017-08- 22":"3"},{"2017-09-21":"4"},{"2017- 08-08":"3"},{"2017-10- 03":"3"},{"2017-10-19":"5"},{"2017- 09-06":"1"},{"2017-08- 17":"5"},{"2017-09-09":"12"},{"2017- 10-05":"5"},{"2017-09- 25":"4"},{"2017-09-17":"13"}] Role of Spark
  • 13. #Dev1SAIS 13 • Ability to disperse billions of records – HashPartitioner to the rescue //Partition the data by hash HashPartitioner hashPartitioner = new HashPartitioner(partitionNumber); JavaPairRDD<String, Row> hashedRDD = keyedRDD.partitionBy(hashPartitioner); //Fetch each hash partition and process foreach partition{ JavaRDD<Tuple2<String, Row>> filteredHashRDD = filterRows(hashedRDD, index, paritionId); raise error if partition mismatch Dataset<Row> filteredDataSet = etlContext.getSparkSession().createDataset(filteredHashRDD.map(tuple -> tuple._2()).rdd(), data.org$apache$spark$sql$Dataset$$encoder); //repartition filteredDataSet, update checkpoint and records processed after successful completion. Paren t RDD P1 P2 P3 Pn ….. Process Each Partition Role of Spark (continued)
  • 14. #Dev1SAIS 14 2018-02-01 2018-02-02 2018-02-03 2018-03-01 …. Dispersal C* 2018-02-01 2018-02-02 2018-02-03 2018-03-01 Bootstrap • Global Throttling – Feature Store can be the bottleneck – coalesce() to limit the executors • Inspect data – Disperse only if any column has changed • Monitoring and alert – create custom metrics Role of Spark in Dispersal Full computation snapshots Optimized snapshots
  • 15. #Dev1SAIS ● Real-time, summable aggregations for < 24 hours ● Semantically equivalent to offline computation ● Aggregation rollups (5 mins) maintained in feature store (Cassandra) event enrichment raw kafka events microservices xform_0 xform_1 xform_2 streaming computation pipelines time window aggregator C* RPCs Uber Athena streaming 15 Real-time streaming engine
  • 16. #Dev1SAIS ● Uses time series and clustering key support in Cassandra ○ 1 table for Lifetime & LTD values. ○ Multiple tables for realtime values with grain size 5M ● Consult metadata and assemble into single result at feature access time entity_common_aggr_bt_ltd UUID (PK) trip_count_ltd entity_common_aggr_bt UUID (PK) eventbucket (CK) trip_count entity_common_aggr_rt_2018_05_08 entity_common_aggr_rt_2018_05_09 entity_common_aggr_rt_2018_05_10 UUID (PK) eventbucket (CK) trip_coun t Feature access Service Metadata Service Query Planner e.g - lifetime trip count - trips over last 51 hrs - trips over previous 2 days 16 Final aggregation in real time
  • 17. #Dev1SAIS Create Query (Spark SQL) Configure Spec Commit to Prod Test Spec 17 Self-service onboarding
  • 18. #Dev1SAIS Backfill Support: what is the value of a feature f1 for an entity E1 from Thist to Tnow • Bootstrap to historic point in time: Thist • Incrementally compute from Thist to Tnow How ? • Lifetime: feature f1 on Thist access partition Thist • Windowed: feature f2 on Thist with window N days • Merge partitions Thist-N to Thist 18 Machine learning support T-120 T-119 T-90 T-1 T ….. ….. Lifetime value on a given date Last 30 day trips at T-90 Last 30 day trips at T-89
  • 19. #Dev1SAIS ● Use of Spark to achieve massive scale ● Combine with Streaming aggregation for freshness ● Low latency access in production (P99 <= 20ms) at high QPS ● Simplify onboarding via single spec, onboarding time in hours ● Huge computational cost improvements 19 Takeaways
  • 20. Proprietary and confidential © 2018 Uber Technologies, Inc. All rights reserved. No part of this document may be reproduced or utilized in any form or by any means, electronic or mechanical, including photocopying, recording, or by any information storage or retrieval systems, without permission in writing from Uber. This document is intended only for the use of the individual or entity to whom it is addressed and contains information that is privileged, confidential or otherwise exempt from disclosure under applicable law. All recipients of this document are notified that the information contained herein includes proprietary and confidential information of Uber, and recipient may not make use of, disseminate, or in any way disclose this document or any of the enclosed information to any person other than employees of addressee to the extent necessary for consultations with authorized personnel of Uber. bhanotp@uber.com anene@uber.com
  翻译: