SlideShare a Scribd company logo
The future is open
@ItaiYaffe, @ettigur
Digital advertising - a multi-billion dollar industry
● In 2019, internet advertising spending worldwide was over
$290B(statista.com, June 2020)
@ItaiYaffe, @ettigur
Digital advertising - a multi-billion dollar industry
● In 2019, internet advertising spending worldwide was over
$290B(statista.com, June 2020)
● Apple spent over $110M on iPhone & TV+ advertising during
September and October 2019 (9To5Mac.com, November 2019)
@ItaiYaffe, @ettigur
Digital advertising - a multi-billion dollar industry
● In 2019, internet advertising spending worldwide was over
$290B(statista.com, June 2020)
● Apple spent over $110M on iPhone & TV+ advertising during
September and October 2019 (9To5Mac.com, November 2019)
● GM is shifting ‘SIGNIFICANT’ dollars to connected TV advertising
(AdAge.com, January 2020)
@ItaiYaffe, @ettigur
Digital advertising - a multi-billion dollar industry
● In 2019, internet advertising spending worldwide was over
$290B (statista.com, June 2020)
● Apple spent over $110M on iPhone & TV+ advertising during
September and October 2019 (9To5Mac.com, November 2019)
● GM is shifting ‘SIGNIFICANT’ dollars to connected TV advertising
(AdAge.com, January 2020)
$$$ spent each year
on digital advertising campaigns
@ItaiYaffe, @ettigur
What does a funnel look like?
PRODUCT PAGE
10M
HOMEPAGE
15M
5M
Drop-off
AD EXPOSURE
100M
85M
Drop-off
So everybody wants to measure their campaigns’ efficiency!
@ItaiYaffe, @ettigur
What does a funnel look like?
PRODUCT PAGE
10M
HOMEPAGE
15M
5M
Drop-off
AD EXPOSURE
100M
85M
Drop-off
So everybody wants to measure their campaigns’ efficiency!
But how???
Funnel Analysis with
Apache Spark and Druid
Etti Gur, Nielsen
Itai Yaffe, Imply
@ItaiYaffe, @ettigur
Introduction
Etti Gur
● Senior Big Data Engineer @ Nielsen
● Building data pipelines using Spark,
Kafka, Druid, Airflow and more
Etti Gur @ettigur
Itai Yaffe
● Principal Solutions Architect @ Imply
Prev. Big Data Tech Lead @ Nielsen
● Dealing with Big Data challenges since 2012
● Itai Yaffe @ItaiYaffe
@ItaiYaffe, @ettigur
Nielsen Identity
● Data and Measurement company
● Media consumption
● Single source of truth of individuals and households
○ Unifies many proprietary datasets
○ Generates holistic view of a consumer
@ItaiYaffe, @ettigur
Nielsen Identity in numbers
>10B events/day 60TB/day
S3
6000 nodes/day
10’s of TB
ingested/day
druid
@ItaiYaffe, @ettigur
Scalability
Cost Efficiency
Fault-tolerance
The challenges
@ItaiYaffe, @ettigur
What you will learn?
How to overcome the technical challenges of Funnel Analysis
@ItaiYaffe, @ettigur
What you will learn?
How to overcome the technical challenges of Funnel Analysis
using Apache Spark, Druid and DataSketches,
@ItaiYaffe, @ettigur
What you will learn?
How to overcome the technical challenges of Funnel Analysis
using Apache Spark, Druid and DataSketches,
and why you should even care
@ItaiYaffe, @ettigur
Campaign phases - user’s point-of-view
Awareness
Exposed to
campaign (e.g
via online ad)
Consideration
Interest is
expressed (e.g
clicked ad)
Intent
Steps taken towards
making a purchase (e.g
added product to cart)
Purchase
@ItaiYaffe, @ettigur
Campaign phases - user’s point-of-view
Awareness
Exposed to
campaign (e.g
via online ad)
Consideration
Interest is
expressed (e.g
clicked ad)
Intent
Steps taken towards
making a purchase (e.g
added product to cart)
Purchase
Tactic Stages
@ItaiYaffe, @ettigur
Campaign phases - campaign owner’s point-of-view
Awareness Consideration Intent Purchase
Drop-
off
Drop-
off
Drop-
off
@ItaiYaffe, @ettigur
PRODUCT PAGE
10M UUs
HOMEPAGE
15M UUs
7M
Drop-off
5M
Drop-off
AD EXPOSURE
100M UUs
85M
Drop-off
Campaign phases - why is it called “a funnel”?
* UUs = Unique Users
CHECKOUT
3M UUs
@ItaiYaffe, @ettigur
PRODUCT PAGE
10M UUs
HOMEPAGE
15M UUs
7M
Drop-off
5M
Drop-off
AD EXPOSURE
100M UUs
85M
Drop-off
Campaign phases - why is it called “a funnel”?
* UUs = Unique Users
CHECKOUT
3M UUs
We need to analyze the funnel, hence:
“Funnel Analysis”
@ItaiYaffe, @ettigur
Views vs Unique Users
2 Unique Users
7 Views
2 Purchases $$$ $$$
@ItaiYaffe, @ettigur
Everybody wants to measure their campaigns’ efficiency!
What does a funnel look like?
PRODUCT PAGE
10M
HOMEPAGE
15M
5M
Drop-off
AD EXPOSURE
100M
85M
Drop-off
@ItaiYaffe, @ettigur
But how can one measure campaign efficiency?
● Collect a huge stream of events (i.e user activities)
while the campaign is live
● Map events to funnel stages
○ E.g ad exposure = tactic
● Provide insights quickly
@ItaiYaffe, @ettigur
So… what’s wrong with off-the-shelf alternatives?
Topic Off-the-shelf alternatives
Scalability Limited
Access to raw data Lack access
Count-distinct operations Very slow
* Based on tinyurl.com/qqza5ur
@ItaiYaffe, @ettigur
Introducing: Apache Druid
@ItaiYaffe, @ettigur
Why is it cool?
● Store trillions of events, petabytes of data
● Sub-second analytic queries
● Highly scalable
● Cost effective
● Decoupled architecture
○ E.g ingestion is separated from query
@ItaiYaffe, @ettigur
Roll-up - Simple Count (Views)
LongSumAggregator
2021-05-26
Timestamp Website Device ID
www.a.com 3a4c1f2d84a5c179435c1fea86e6ae02
2021-05-26 www.a.com 3a4c1f2d84a5c179435c1fea86e6ae02
2021-05-26 www.a.com 5dd59f9bd068f802a7c6dd832bf60d02
2021-05-26 www.b.com 5dd59f9bd068f802a7c6dd832bf60d02
2021-05-26 www.c.com 5dd59f9bd068f802a7c6dd832bf60d02
Timestamp Website Views
2021-05-26
2021-05-26
2021-05-26
www.a.com 3
1
1
www.b.com
www.c.com
@ItaiYaffe, @ettigur
Druid architecture
@ItaiYaffe, @ettigur
Powered by Druid
@ItaiYaffe, @ettigur
Common use-cases for Druid
● Clickstream analytics
○ Funnel analysis
● Network performance monitoring
● Application performance management
● Supply chain analytics
○ Manufacturing (IoT and device) metrics
● BI and OLAP
● And more...
@ItaiYaffe, @ettigur
Druid in a nutshell
● A real-time analytics database
○ Time-series, columnar
● Can ingest and store trillions of events, and serve analytic queries in
sub-second
● Highly-scalable, cost-effective
● Widely used among Big Data companies for:
○ Application performance management
○ Clickstream analytics and funnel analysis
○ And more
@ItaiYaffe, @ettigur
Druid in a nutshell
● A real-time analytics database
○ Time-series, columnar
● Can ingest and store trillions of events, and serve analytic queries in
sub-second
● Highly-scalable, cost-effective
● Widely used among Big Data companies for:
○ Application performance management
○ Clickstream analytics and funnel analysis
○ And more
@ItaiYaffe, @ettigur
Why is Druid suitable for the task?
Topic Off-the-shelf
alternatives
Druid
Scalability Limited Highly scalable
Access to raw
data
Lack access Can store trillions of events
Count-distinct
operations
Very slow Sub-second approximate count distinct
with set operations
using the Theta Sketch module
* Based on tinyurl.com/qqza5ur
@ItaiYaffe, @ettigur
Why is Druid suitable for the task?
Topic Off-the-shelf
alternatives
Druid
Scalability Limited Highly scalable
Access to raw
data
Lack access Can store trillions of events
Count-distinct
operations
Very slow Sub-second approximate count distinct
with set operations
using the Theta Sketch module
Theta Sketch???
* Based on tinyurl.com/qqza5ur
@ItaiYaffe, @ettigur
What is Theta Sketch?
● ThetaSketch mathematical framework - generalization of KMV
● K Minimum Values (KMV)
● Estimate set cardinality
● Supports set-theoretic operations
Number of Std Dev 1 2
Confidence Interval 68.27% 95.45%
16,384 0.78% 1.56%
32,768 0.55% 1.10%
65,536 0.39% 0.78%
Error as function of K
Theta Sketch error
* Larger K = more memory & storage needed
@ItaiYaffe
@ItaiYaffe, @ettigur
Theta Sketch demo
tinyurl.com/ugk6p67
@ItaiYaffe, @ettigur
The Theta Sketch module in Druid
● Part of the Apache DataSketches library (datasketches.apache.org)
● At ingestion time
○ Sketches are created and stored in Druid segments
● At query time
○ Sketches are aggregated (i.e union, intersection or difference
between sketches)
○ The result - estimated number of unique entries in the aggregated
sketch
● Also see this short video - tinyurl.com/vdwojh6
@ItaiYaffe, @ettigur
Roll-up - Count Distinct (Unique Users)
2021-05-26
Timestamp Website Device ID
www.a.com 3a4c1f2d84a5c179435c1fea86e6ae02
2021-05-26 www.a.com 3a4c1f2d84a5c179435c1fea86e6ae02
2021-05-26 www.a.com 5dd59f9bd068f802a7c6dd832bf60d02
2021-05-26 www.b.com 5dd59f9bd068f802a7c6dd832bf60d02
2021-05-26 www.c.com 5dd59f9bd068f802a7c6dd832bf60d02
Timestamp Website Unique Users*
2021-05-26
2021-05-26
2021-05-26
www.a.com 2*
1*
1*
www.b.com
www.c.com
ThetaSketchAggregator
* What is actually stored is a
ThetaSketch object.
The actual result is calculated
in real-time, which allows us
to do UNIONs and INTERSECTIONs
@ItaiYaffe, @ettigur
Cool, so… Back to funnel analysis?
@ItaiYaffe, @ettigur
Funnel analysis - simple use-case
How many unique users viewed online ad?
VS
How many unique users viewed
online ad AND viewed product X page?
@ItaiYaffe, @ettigur
Funnel analysis - simple use-case
5/1/2021 - 5/26/2021
5/1/2021 - 5/26/2021
@ItaiYaffe, @ettigur
Funnel analysis - simple use-case
@ItaiYaffe, @ettigur
Funnel analysis pipeline - high-level architecture
1.
Read files
of last day
Data
Lake
2.
Write files by
campaign,date
Campaigns'
marts
3.
Read files
per campaign
4.
Write files by
date,campaign
Enriched
data
5.
Load data
by campaign
Mart Generator Enricher
@ItaiYaffe, @ettigur
Funnel analysis pipeline - Data Lake
1.
Read files
of last day
Data
Lake
2.
Write files by
campaign,date
Campaigns'
marts
3.
Read files
per campaign
4.
Write files by
date,campaign
Enriched
data
5.
Load data
by campaign
Mart Generator Enricher
{event_time=2021-05-26T..., userid=uid1, attribute=online_ad}
{event_time=2021-05-26T..., userid=uid1, attribute=homepage}
{event_time=2021-05-26T..., userid=uid1, attribute=productX_page}
....
date=2021-05-24
date=2021-05-25
date=2021-05-26
@ItaiYaffe, @ettigur
Funnel analysis pipeline - Mart Generator
1.
Read files
of last day
Data
Lake
2.
Write files by
campaign,date
Campaigns'
marts
3.
Read files
per campaign
4.
Write files by
date,campaign
Enriched
data
5.
Load data
by campaign
Mart Generator Enricher
campaign=1210
campaign=1319
campaign=1472
date=2021-05-24
date=2021-05-25
date=2021-05-26
{event_time=2021-05-26T... , userid=uid1, attribute=online_ad, type=Tactic}
{event_time=2021-05-26T... , userid=uid1, attribute=homepage, type=Stage}
{event_time=2021-05-26T... , userid=uid1, attribute=productX_page , type=Stage}
....
@ItaiYaffe, @ettigur
Funnel analysis pipeline - Enricher
1.
Read files
of last day
Data
Lake
2.
Write files by
campaign,date
Campaigns'
marts
3.
Read files
per campaign
4.
Write files by
date,campaign
Enriched
data
5.
Load data
by campaign
Mart Generator Enricher
campaign=1210
campaign=1319
campaign=1472
date=2021-05-24
date=2021-05-25
date=2021-05-26
{event_date=2021-05-26, userid=uid1, tactic=online_ad, stage=homepage}
{event_date=2021-05-26, userid=uid1, tactic=online_ad, stage=productX_page }
....
....
@ItaiYaffe, @ettigur
Funnel analysis pipeline - ingesting data into Druid
1.
Read files
of last day
Data
Lake
2.
Write files by
campaign,date
Campaigns'
marts
3.
Read files
per campaign
4.
Write files by
date,campaign
Enriched
data
5.
Load data
by campaign
Mart Generator Enricher
"type": "index_hadoop",
"spec": {
"dataSchema": {
"dataSource": "campaign_1472",
"granularitySpec": {
"queryGranularity": "day",
"segmentGranularity": "day",
"type": "uniform",
"intervals": ["2021-05-01/2021-05-27"]
...
@ItaiYaffe, @ettigur
Funnel analysis pipeline - ingesting data into Druid
1.
Read files
of last day
Data
Lake
2.
Write files by
campaign,date
Campaigns'
marts
3.
Read files
per campaign
4.
Write files by
date,campaign
Enriched
data
5.
Load data
by campaign
Mart Generator Enricher
"timestampSpec": {
"column": "event_date", "format": "yyyy-MM-dd"
},
"dimensionsSpec": {
"dimensions": ["tactic", "stage"]
},
"metricsSpec": [{
"fieldName": "userid", "type": "thetaSketch",
"name": "user_id_sketch", "size": 65536}],
...
@ItaiYaffe, @ettigur
Funnel analysis pipeline - ingesting data into Druid
1.
Read files
of last day
Data
Lake
2.
Write files by
campaign,date
Campaigns'
marts
3.
Read files
per campaign
4.
Write files by
date,campaign
Enriched
data
5.
Load data
by campaign
Mart Generator Enricher
"inputSpec": {"type": " multi",
"children": [
{"type": " dataSource",
"ingestionSpec": {
"intervals": ["2021-05-01/2021-05-27"],
"dataSource": "campaign_1472", ...}},
{"type": " static",
"Paths": "s3://<BUCKET_NAME>/date=2021-05-26/campaign=1472",
...},
...
@ItaiYaffe, @ettigur
Funnel analysis pipeline - Druid datasources
1.
Read files
of last day
Data
Lake
2.
Write files by
campaign,date
Campaigns'
marts
3.
Read files
per campaign
4.
Write files by
date,campaign
Enriched
data
5.
Load data
by campaign
Mart Generator Enricher
{__time=2021-05-26, tactic=online_ad, stage=homepage, user_id_sketch=<Object>}
{__time=2021-05-26, tactic=online_ad, stage=productX_page , user_id_sketch=<Object>}
....
....
campaign_1210
campaign_1319
campaign_1472
@ItaiYaffe, @ettigur
Funnel analysis pipeline - querying Druid (SQL)
1.
Read files
of last day
Data
Lake
2.
Write files by
campaign,date
Campaigns'
marts
3.
Read files
per campaign
4.
Write files by
date,campaign
Enriched
data
5.
Load data
by campaign
Mart Generator Enricher
SELECT
APPROX_COUNT_DISTINCT_DS_THETA(user_id_sketch,65536)
as homepage_sketch
FROM campaign_1472
WHERE (("tactic" = 'online_ad')
AND ("stage" = 'homepage'))
AND __time BETWEEN '2021-05-01T00:00:00.000'
AND '2021-05-26T23:59:59.000'
* This specific query returns the estimated number of unique users
that viewed the online ad AND viewed the homepage
@ItaiYaffe, @ettigur
Funnel analysis - simple use-case revisited
5/1/2021 - 5/26/2021
@ItaiYaffe, @ettigur
Funnel analysis - simple use-case revisited
@ItaiYaffe, @ettigur
Funnel analysis - simple use-case revisited
@ItaiYaffe, @ettigur
Funnel analysis - simple use-case revisited
3,100 - 2,500 != 1000
@ItaiYaffe, @ettigur
PRODUCT PAGE
1K UUs
...
HOMEPAGE
3.1K UUs
2.5K
Drop-off
ONLINE AD
8.1M UUs
Funnel analysis - simple use-case revisited
* UUs = Unique Users
@ItaiYaffe, @ettigur
PRODUCT PAGE
1K UUs
...
HOMEPAGE
3.1K UUs
ONLINE AD
8.1M UUs
Funnel analysis - simple use-case revisited
* UUs = Unique Users
2.5K
Drop-off
@ItaiYaffe, @ettigur
Funnel analysis - simple complex use-case
How many unique users viewed online ad?
VS
How many unique users
viewed online ad FIRST and
THEN viewed product X page?
@ItaiYaffe, @ettigur
Funnel analysis - complex use-case
● This is what we call a sequential funnel
○ Chronological order of events is important
● The data pipeline is very similar, but…
○ Taking into account only events that happened in the pre-defined order
of the funnel
● That way we better represent the efficiency of a specific tactic
(i.e advertisement)
@ItaiYaffe, @ettigur
Funnel analysis pipeline - reminder
1.
Read files
of last day
Data
Lake
2.
Write files by
campaign,date
Campaigns'
marts
3.
Read files
per campaign
4.
Write files by
date,campaign
Enriched
data
5.
Load data
by campaign
Mart Generator Enricher
@ItaiYaffe, @ettigur
Funnel analysis pipeline - Data Lake
1.
Read files
of last day
Data
Lake
2.
Write files by
campaign,date
Campaigns'
marts
3.
Read files
per campaign
4.
Write files by
date,campaign
Enriched
data
5.
Load data
by campaign
Mart Generator Enricher
{event_time=2021-05-26T09:15, userid=uid1, attribute=productX_page}
{event_time=2021-05-26T10:10, userid=uid1, attribute=online_ad}
{event_time=2021-05-26T10:11, userid=uid1, attribute=homepage}
....
date=2021-05-24
date=2021-05-25
date=2021-05-26
@ItaiYaffe, @ettigur
Funnel analysis pipeline - Mart Generator
1.
Read files
of last day
Data
Lake
2.
Write files by
campaign,date
Campaigns'
marts
3.
Read files
per campaign
4.
Write files by
date,campaign
Enriched
data
5.
Load data
by campaign
Mart Generator Enricher
campaign=1210
campaign=1319
campaign=1472
date=2021-05-24
date=2021-05-25
date=2021-05-26
{event_time=2021-05-26T09:15 , userid=uid1, attribute=productX_page , type=Stage}
{event_time=2021-05-26T10:10 , userid=uid1, attribute=online_ad, type=Tactic}
{event_time=2021-05-26T10:11 , userid=uid1, attribute=homepage, type=Stage}
....
@ItaiYaffe, @ettigur
Funnel analysis pipeline - Enricher
1.
Read files
of last day
Data
Lake
2.
Write files by
campaign,date
Campaigns'
marts
3.
Read files
per campaign
4.
Write files by
date,campaign
Enriched
data
5.
Load data
by campaign
Mart Generator Enricher
campaign=1210
campaign=1319
campaign=1472
date=2021-05-24
date=2021-05-25
date=2021-05-26
{event_date=2021-05-26, userid=uid1, tactic=online_ad, stage=productX_page }
{event_date=2021-05-26, userid=uid1, tactic=online_ad, stage=homepage}
....
@ItaiYaffe, @ettigur
Funnel analysis pipeline - Enricher
1.
Read files
of last day
Data
Lake
2.
Write files by
campaign,date
Campaigns'
marts
3.
Read files
per campaign
4.
Write files by
date,campaign
Enriched
data
5.
Load data
by campaign
Mart Generator Enricher
campaign=1210
campaign=1319
campaign=1472
date=2021-05-24
date=2021-05-25
date=2021-05-26
{event_date=2021-05-26, userid=uid1, tactic=online_ad, stage=productX_page }
{event_date=2021-05-26, userid=uid1, tactic=online_ad, stage=homepage}
....
@ItaiYaffe, @ettigur
Funnel analysis pipeline - Enricher
1.
Read files
of last day
Data
Lake
2.
Write files by
campaign,date
Campaigns'
marts
3.
Read files
per campaign
4.
Write files by
date,campaign
Enriched
data
5.
Load data
by campaign
Mart Generator Enricher
campaign=1210
campaign=1319
campaign=1472
date=2021-05-24
date=2021-05-25
date=2021-05-26
{event_date=2021-05-26, userid=uid1, tactic=online_ad, stage=homepage}
....
....
@ItaiYaffe, @ettigur
Funnel analysis pipeline - querying Druid (SQL)
1.
Read files
of last day
Data
Lake
2.
Write files by
campaign,date
Campaigns'
marts
3.
Read files
per campaign
4.
Write files by
date,campaign
Enriched
data
5.
Load data
by campaign
Mart Generator Enricher
SELECT APPROX_COUNT_DISTINCT_DS_THETA(THETA_SKETCH_NOT(65536,
THETA_SKETCH_INTERSECT(65536,a,b), THETA_SKETCH_UNION(65536,c,d,e))) as dropoff_sketch
FROM ( SELECT
DS_THETA("user_id_sketch") FILTER (WHERE tactic = 'online_ad') as a,
DS_THETA("user_id_sketch") FILTER (WHERE stage = 'homepage') as b,
DS_THETA("user_id_sketch") FILTER (WHERE stage = 'productX_page') as c,
DS_THETA("user_id_sketch") FILTER (WHERE stage = 'add_to_cart') as d,
DS_THETA("user_id_sketch") FILTER (WHERE stage = 'checkout') as e
FROM campaign_1472
WHERE stage in ('homepage','productX_page','add_to_cart','checkout')
AND tactic = 'online_ad'
AND __time BETWEEN '2021-05-01T00:00:00.000' AND '2021-05-26T23:59:59.000' )
subquery
* This specific query should return the estimated number of unique
users for the drop-off between the homepage and product X page
@ItaiYaffe, @ettigur
Funnel analysis - complex use-case
5/1/2021 - 5/26/2021
@ItaiYaffe, @ettigur
Funnel analysis - complex use-case
@ItaiYaffe, @ettigur
Funnel analysis - complex use-case
@ItaiYaffe, @ettigur
Funnel analysis - complex use-case
3,100 - 2,500 = 600
@ItaiYaffe, @ettigur
PRODUCT PAGE
0.6K UUs
...
HOMEPAGE
3.1K UUs
2.5K
Drop-off
ONLINE AD
8.1M UUs
Funnel analysis - complex use-case
* UUs = Unique Users
@ItaiYaffe, @ettigur
PRODUCT PAGE
0.6K UUs
...
HOMEPAGE
3.1K UUs
ONLINE AD
8.1M UUs
Funnel analysis - complex use-case
* UUs = Unique Users
2.5K
Drop-off
@ItaiYaffe, @ettigur
A few tips
● Use Druid with Theta Sketch for fast approximate count distinct
○ Allows set operations (intersection/union/negation)
● Use Spark to pre-process incoming events
○ Allows you to take into account only events that happened in the
pre-defined order of the funnel
○ Check out Etti’s “Optimizing Spark-based data pipelines” talk
(video - tinyurl.com/7hvyxtc8, slides - tinyurl.com/3rvc9mus)
● Optimize your ingestion process
○ Write Theta Sketch objects from Spark app
○ Load to Druid using isInputThetaSketch=true flag
@ItaiYaffe, @ettigur
What have we learned?
● Funnel analysis
○ Very important for advertisers
○ Not easy to solve technically (especially if chronological order of events matters)
@ItaiYaffe, @ettigur
What have we learned?
● Funnel analysis
○ Very important for advertisers
○ Not easy to solve technically (especially if chronological order of events matters)
● Druid is a very powerful tool for real-time analytics
○ Highly scalable, can ingest and store trillions of events, and serve analytic queries in
sub-second
○ Used for many different use-cases
@ItaiYaffe, @ettigur
What have we learned?
● Funnel analysis
○ Very important for advertisers
○ Not easy to solve technically (especially if chronological order of events matters)
● Druid is a very powerful tool for real-time analytics
○ Highly scalable, can ingest and store trillions of events, and serve analytic queries in
sub-second
○ Used for many different use-cases
● Combining Apache Spark, Druid and DataSkecthes FTW!
○ Pre-process events before ingesting into Druid
○ Decide how to handle out-of-order events
@ItaiYaffe, @ettigur
DRUID
ES
Want to know more?
● Women in Big Data
○ A world-wide program that aims :
■ To inspire, connect, grow, and champion success of women in the Big Data & analytics field
○ 30+ chapters and 17,000+ members world-wide
○ Everyone can join (regardless of gender), so find a chapter near you -
www.womeninbigdata.org/membership/
● Conference talks
○ Casting the Spell: Druid in Practice (Berlin Buzzwords, June 17th 2021) - tinyurl.com/559hufnj
○ Migrating Airflow-based Spark Jobs to K8s (Data+AI Summit Europe 2020) - tinyurl.com/cbm42mn8
● Our Tech Blog - medium.com/nmc-techblog
○ Data Retention and Deletion in Apache Druid - tinyurl.com/yymrvrn2
QUESTIONS
THANK YOU
Etti Gur Etti Gur
Itai Yaffe Itai Yaffe
Feedback
Your feedback is important to us.
Don’t forget to rate and review the sessions.
Ad

More Related Content

What's hot (20)

State of the Trino Project
State of the Trino ProjectState of the Trino Project
State of the Trino Project
Martin Traverso
 
Apache Flink and what it is used for
Apache Flink and what it is used forApache Flink and what it is used for
Apache Flink and what it is used for
Aljoscha Krettek
 
Real-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache PinotReal-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache Pinot
Xiang Fu
 
Apache Ranger Hive Metastore Security
Apache Ranger Hive Metastore Security Apache Ranger Hive Metastore Security
Apache Ranger Hive Metastore Security
DataWorks Summit/Hadoop Summit
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Databricks
 
Pulsar in the Lakehouse: Apache Pulsar™ with Apache Spark™ and Delta Lake - P...
Pulsar in the Lakehouse: Apache Pulsar™ with Apache Spark™ and Delta Lake - P...Pulsar in the Lakehouse: Apache Pulsar™ with Apache Spark™ and Delta Lake - P...
Pulsar in the Lakehouse: Apache Pulsar™ with Apache Spark™ and Delta Lake - P...
StreamNative
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processing
Till Rohrmann
 
Airflow introduction
Airflow introductionAirflow introduction
Airflow introduction
Chandler Huang
 
Data ingestion and distribution with apache NiFi
Data ingestion and distribution with apache NiFiData ingestion and distribution with apache NiFi
Data ingestion and distribution with apache NiFi
Lev Brailovskiy
 
[NDC18] 야생의 땅 듀랑고의 데이터 엔지니어링 이야기: 로그 시스템 구축 경험 공유
[NDC18] 야생의 땅 듀랑고의 데이터 엔지니어링 이야기: 로그 시스템 구축 경험 공유[NDC18] 야생의 땅 듀랑고의 데이터 엔지니어링 이야기: 로그 시스템 구축 경험 공유
[NDC18] 야생의 땅 듀랑고의 데이터 엔지니어링 이야기: 로그 시스템 구축 경험 공유
Hyojun Jeon
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Noritaka Sekiyama
 
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Flink Forward
 
Apache flink
Apache flinkApache flink
Apache flink
pranay kumar
 
Apache Flink Deep Dive
Apache Flink Deep DiveApache Flink Deep Dive
Apache Flink Deep Dive
DataWorks Summit
 
Apache Druid 101
Apache Druid 101Apache Druid 101
Apache Druid 101
Data Con LA
 
Virtual Flink Forward 2020: A deep dive into Flink SQL - Jark Wu
Virtual Flink Forward 2020: A deep dive into Flink SQL - Jark WuVirtual Flink Forward 2020: A deep dive into Flink SQL - Jark Wu
Virtual Flink Forward 2020: A deep dive into Flink SQL - Jark Wu
Flink Forward
 
Facebook Presto presentation
Facebook Presto presentationFacebook Presto presentation
Facebook Presto presentation
Cyanny LIANG
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
Cloudera, Inc.
 
Apache Nifi Crash Course
Apache Nifi Crash CourseApache Nifi Crash Course
Apache Nifi Crash Course
DataWorks Summit
 
ksqlDB: A Stream-Relational Database System
ksqlDB: A Stream-Relational Database SystemksqlDB: A Stream-Relational Database System
ksqlDB: A Stream-Relational Database System
confluent
 
State of the Trino Project
State of the Trino ProjectState of the Trino Project
State of the Trino Project
Martin Traverso
 
Apache Flink and what it is used for
Apache Flink and what it is used forApache Flink and what it is used for
Apache Flink and what it is used for
Aljoscha Krettek
 
Real-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache PinotReal-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache Pinot
Xiang Fu
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Databricks
 
Pulsar in the Lakehouse: Apache Pulsar™ with Apache Spark™ and Delta Lake - P...
Pulsar in the Lakehouse: Apache Pulsar™ with Apache Spark™ and Delta Lake - P...Pulsar in the Lakehouse: Apache Pulsar™ with Apache Spark™ and Delta Lake - P...
Pulsar in the Lakehouse: Apache Pulsar™ with Apache Spark™ and Delta Lake - P...
StreamNative
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processing
Till Rohrmann
 
Data ingestion and distribution with apache NiFi
Data ingestion and distribution with apache NiFiData ingestion and distribution with apache NiFi
Data ingestion and distribution with apache NiFi
Lev Brailovskiy
 
[NDC18] 야생의 땅 듀랑고의 데이터 엔지니어링 이야기: 로그 시스템 구축 경험 공유
[NDC18] 야생의 땅 듀랑고의 데이터 엔지니어링 이야기: 로그 시스템 구축 경험 공유[NDC18] 야생의 땅 듀랑고의 데이터 엔지니어링 이야기: 로그 시스템 구축 경험 공유
[NDC18] 야생의 땅 듀랑고의 데이터 엔지니어링 이야기: 로그 시스템 구축 경험 공유
Hyojun Jeon
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Noritaka Sekiyama
 
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Flink Forward
 
Apache Druid 101
Apache Druid 101Apache Druid 101
Apache Druid 101
Data Con LA
 
Virtual Flink Forward 2020: A deep dive into Flink SQL - Jark Wu
Virtual Flink Forward 2020: A deep dive into Flink SQL - Jark WuVirtual Flink Forward 2020: A deep dive into Flink SQL - Jark Wu
Virtual Flink Forward 2020: A deep dive into Flink SQL - Jark Wu
Flink Forward
 
Facebook Presto presentation
Facebook Presto presentationFacebook Presto presentation
Facebook Presto presentation
Cyanny LIANG
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
Cloudera, Inc.
 
ksqlDB: A Stream-Relational Database System
ksqlDB: A Stream-Relational Database SystemksqlDB: A Stream-Relational Database System
ksqlDB: A Stream-Relational Database System
confluent
 

Similar to Funnel Analysis with Apache Spark and Druid (20)

Google BigQuery for Everyday Developer
Google BigQuery for Everyday DeveloperGoogle BigQuery for Everyday Developer
Google BigQuery for Everyday Developer
Márton Kodok
 
Optimizing Spark-based data pipelines - are you up for it?
Optimizing Spark-based data pipelines - are you up for it?Optimizing Spark-based data pipelines - are you up for it?
Optimizing Spark-based data pipelines - are you up for it?
Itai Yaffe
 
IoT digital disruption and new IoT business models
IoT digital disruption and new IoT business modelsIoT digital disruption and new IoT business models
IoT digital disruption and new IoT business models
IoTAnalytics
 
IoT Meets Big Data: The Opportunities and Challenges by Syed Hoda of ParStream
IoT Meets Big Data: The Opportunities and Challenges by Syed Hoda of ParStreamIoT Meets Big Data: The Opportunities and Challenges by Syed Hoda of ParStream
IoT Meets Big Data: The Opportunities and Challenges by Syed Hoda of ParStream
gogo6
 
[DSC Europe 24] Ved Prakash - Supercharging Your Data Strategy: Building a Sc...
[DSC Europe 24] Ved Prakash - Supercharging Your Data Strategy: Building a Sc...[DSC Europe 24] Ved Prakash - Supercharging Your Data Strategy: Building a Sc...
[DSC Europe 24] Ved Prakash - Supercharging Your Data Strategy: Building a Sc...
DataScienceConferenc1
 
900 keynote abbott
900 keynote abbott900 keynote abbott
900 keynote abbott
Rising Media, Inc.
 
How to Guarantee Exact COUNT DISTINCT Queries with Sub-Second Latency on Mass...
How to Guarantee Exact COUNT DISTINCT Queries with Sub-Second Latency on Mass...How to Guarantee Exact COUNT DISTINCT Queries with Sub-Second Latency on Mass...
How to Guarantee Exact COUNT DISTINCT Queries with Sub-Second Latency on Mass...
Tyler Wishnoff
 
How to Guarantee Exact Count Distinct Queries with Sub-Second Latency on Mass...
How to Guarantee Exact Count Distinct Queries with Sub-Second Latency on Mass...How to Guarantee Exact Count Distinct Queries with Sub-Second Latency on Mass...
How to Guarantee Exact Count Distinct Queries with Sub-Second Latency on Mass...
SamanthaBerlant
 
VoxxedDays Bucharest 2017 - Powering interactive data analysis with Google Bi...
VoxxedDays Bucharest 2017 - Powering interactive data analysis with Google Bi...VoxxedDays Bucharest 2017 - Powering interactive data analysis with Google Bi...
VoxxedDays Bucharest 2017 - Powering interactive data analysis with Google Bi...
Márton Kodok
 
INTERFACE, by apidays - The Evolution of Data Movement.pdf
INTERFACE, by apidays - The Evolution of Data Movement.pdfINTERFACE, by apidays - The Evolution of Data Movement.pdf
INTERFACE, by apidays - The Evolution of Data Movement.pdf
apidays
 
SF Big Analytics Meetup - Exact Count Distinct with Apache Kylin
SF Big Analytics Meetup - Exact Count Distinct with Apache KylinSF Big Analytics Meetup - Exact Count Distinct with Apache Kylin
SF Big Analytics Meetup - Exact Count Distinct with Apache Kylin
SamanthaBerlant
 
AT&T Mobile App & IoT Hackathon @ Catalyst
AT&T Mobile App & IoT Hackathon @ Catalyst AT&T Mobile App & IoT Hackathon @ Catalyst
AT&T Mobile App & IoT Hackathon @ Catalyst
Ed Donahue
 
Findability Day 2016 - Big data analytics and machine learning
Findability Day 2016 - Big data analytics and machine learningFindability Day 2016 - Big data analytics and machine learning
Findability Day 2016 - Big data analytics and machine learning
Findwise
 
Voxxed Days Cluj - Powering interactive data analysis with Google BigQuery
Voxxed Days Cluj - Powering interactive data analysis with Google BigQueryVoxxed Days Cluj - Powering interactive data analysis with Google BigQuery
Voxxed Days Cluj - Powering interactive data analysis with Google BigQuery
Márton Kodok
 
Business-Plan 3 million
Business-Plan 3 millionBusiness-Plan 3 million
Business-Plan 3 million
Garth Stevens
 
Adding Velocity to BigBench, Todor Ivanov, Patrick Bedué, Roberto Zicari, Ahm...
Adding Velocity to BigBench, Todor Ivanov, Patrick Bedué, Roberto Zicari, Ahm...Adding Velocity to BigBench, Todor Ivanov, Patrick Bedué, Roberto Zicari, Ahm...
Adding Velocity to BigBench, Todor Ivanov, Patrick Bedué, Roberto Zicari, Ahm...
DataBench
 
Adding Velocity to BigBench
Adding Velocity to BigBenchAdding Velocity to BigBench
Adding Velocity to BigBench
t_ivanov
 
Big Data for Product Managers
Big Data for Product ManagersBig Data for Product Managers
Big Data for Product Managers
Pentaho
 
Splunk Artificial Intelligence & Machine Learning Webinar
Splunk Artificial Intelligence & Machine Learning WebinarSplunk Artificial Intelligence & Machine Learning Webinar
Splunk Artificial Intelligence & Machine Learning Webinar
Splunk
 
Tools, Tips and Techniques for Developing Real-time Apps. Phil Leggetter
Tools, Tips and Techniques for Developing Real-time Apps. Phil LeggetterTools, Tips and Techniques for Developing Real-time Apps. Phil Leggetter
Tools, Tips and Techniques for Developing Real-time Apps. Phil Leggetter
Future Insights
 
Google BigQuery for Everyday Developer
Google BigQuery for Everyday DeveloperGoogle BigQuery for Everyday Developer
Google BigQuery for Everyday Developer
Márton Kodok
 
Optimizing Spark-based data pipelines - are you up for it?
Optimizing Spark-based data pipelines - are you up for it?Optimizing Spark-based data pipelines - are you up for it?
Optimizing Spark-based data pipelines - are you up for it?
Itai Yaffe
 
IoT digital disruption and new IoT business models
IoT digital disruption and new IoT business modelsIoT digital disruption and new IoT business models
IoT digital disruption and new IoT business models
IoTAnalytics
 
IoT Meets Big Data: The Opportunities and Challenges by Syed Hoda of ParStream
IoT Meets Big Data: The Opportunities and Challenges by Syed Hoda of ParStreamIoT Meets Big Data: The Opportunities and Challenges by Syed Hoda of ParStream
IoT Meets Big Data: The Opportunities and Challenges by Syed Hoda of ParStream
gogo6
 
[DSC Europe 24] Ved Prakash - Supercharging Your Data Strategy: Building a Sc...
[DSC Europe 24] Ved Prakash - Supercharging Your Data Strategy: Building a Sc...[DSC Europe 24] Ved Prakash - Supercharging Your Data Strategy: Building a Sc...
[DSC Europe 24] Ved Prakash - Supercharging Your Data Strategy: Building a Sc...
DataScienceConferenc1
 
How to Guarantee Exact COUNT DISTINCT Queries with Sub-Second Latency on Mass...
How to Guarantee Exact COUNT DISTINCT Queries with Sub-Second Latency on Mass...How to Guarantee Exact COUNT DISTINCT Queries with Sub-Second Latency on Mass...
How to Guarantee Exact COUNT DISTINCT Queries with Sub-Second Latency on Mass...
Tyler Wishnoff
 
How to Guarantee Exact Count Distinct Queries with Sub-Second Latency on Mass...
How to Guarantee Exact Count Distinct Queries with Sub-Second Latency on Mass...How to Guarantee Exact Count Distinct Queries with Sub-Second Latency on Mass...
How to Guarantee Exact Count Distinct Queries with Sub-Second Latency on Mass...
SamanthaBerlant
 
VoxxedDays Bucharest 2017 - Powering interactive data analysis with Google Bi...
VoxxedDays Bucharest 2017 - Powering interactive data analysis with Google Bi...VoxxedDays Bucharest 2017 - Powering interactive data analysis with Google Bi...
VoxxedDays Bucharest 2017 - Powering interactive data analysis with Google Bi...
Márton Kodok
 
INTERFACE, by apidays - The Evolution of Data Movement.pdf
INTERFACE, by apidays - The Evolution of Data Movement.pdfINTERFACE, by apidays - The Evolution of Data Movement.pdf
INTERFACE, by apidays - The Evolution of Data Movement.pdf
apidays
 
SF Big Analytics Meetup - Exact Count Distinct with Apache Kylin
SF Big Analytics Meetup - Exact Count Distinct with Apache KylinSF Big Analytics Meetup - Exact Count Distinct with Apache Kylin
SF Big Analytics Meetup - Exact Count Distinct with Apache Kylin
SamanthaBerlant
 
AT&T Mobile App & IoT Hackathon @ Catalyst
AT&T Mobile App & IoT Hackathon @ Catalyst AT&T Mobile App & IoT Hackathon @ Catalyst
AT&T Mobile App & IoT Hackathon @ Catalyst
Ed Donahue
 
Findability Day 2016 - Big data analytics and machine learning
Findability Day 2016 - Big data analytics and machine learningFindability Day 2016 - Big data analytics and machine learning
Findability Day 2016 - Big data analytics and machine learning
Findwise
 
Voxxed Days Cluj - Powering interactive data analysis with Google BigQuery
Voxxed Days Cluj - Powering interactive data analysis with Google BigQueryVoxxed Days Cluj - Powering interactive data analysis with Google BigQuery
Voxxed Days Cluj - Powering interactive data analysis with Google BigQuery
Márton Kodok
 
Business-Plan 3 million
Business-Plan 3 millionBusiness-Plan 3 million
Business-Plan 3 million
Garth Stevens
 
Adding Velocity to BigBench, Todor Ivanov, Patrick Bedué, Roberto Zicari, Ahm...
Adding Velocity to BigBench, Todor Ivanov, Patrick Bedué, Roberto Zicari, Ahm...Adding Velocity to BigBench, Todor Ivanov, Patrick Bedué, Roberto Zicari, Ahm...
Adding Velocity to BigBench, Todor Ivanov, Patrick Bedué, Roberto Zicari, Ahm...
DataBench
 
Adding Velocity to BigBench
Adding Velocity to BigBenchAdding Velocity to BigBench
Adding Velocity to BigBench
t_ivanov
 
Big Data for Product Managers
Big Data for Product ManagersBig Data for Product Managers
Big Data for Product Managers
Pentaho
 
Splunk Artificial Intelligence & Machine Learning Webinar
Splunk Artificial Intelligence & Machine Learning WebinarSplunk Artificial Intelligence & Machine Learning Webinar
Splunk Artificial Intelligence & Machine Learning Webinar
Splunk
 
Tools, Tips and Techniques for Developing Real-time Apps. Phil Leggetter
Tools, Tips and Techniques for Developing Real-time Apps. Phil LeggetterTools, Tips and Techniques for Developing Real-time Apps. Phil Leggetter
Tools, Tips and Techniques for Developing Real-time Apps. Phil Leggetter
Future Insights
 
Ad

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Ad

Recently uploaded (20)

Lagos School of Programming Final Project Updated.pdf
Lagos School of Programming Final Project Updated.pdfLagos School of Programming Final Project Updated.pdf
Lagos School of Programming Final Project Updated.pdf
benuju2016
 
Automation Platforms and Process Mining - success story
Automation Platforms and Process Mining - success storyAutomation Platforms and Process Mining - success story
Automation Platforms and Process Mining - success story
Process mining Evangelist
 
RAG Chatbot using AWS Bedrock and Streamlit Framework
RAG Chatbot using AWS Bedrock and Streamlit FrameworkRAG Chatbot using AWS Bedrock and Streamlit Framework
RAG Chatbot using AWS Bedrock and Streamlit Framework
apanneer
 
national income & related aggregates (1)(1).pptx
national income & related aggregates (1)(1).pptxnational income & related aggregates (1)(1).pptx
national income & related aggregates (1)(1).pptx
j2492618
 
Fundamentals of Data Analysis, its types, tools, algorithms
Fundamentals of Data Analysis, its types, tools, algorithmsFundamentals of Data Analysis, its types, tools, algorithms
Fundamentals of Data Analysis, its types, tools, algorithms
priyaiyerkbcsc
 
2-Raction quotient_١٠٠١٤٦.ppt of physical chemisstry
2-Raction quotient_١٠٠١٤٦.ppt of physical chemisstry2-Raction quotient_١٠٠١٤٦.ppt of physical chemisstry
2-Raction quotient_١٠٠١٤٦.ppt of physical chemisstry
bastakwyry
 
Process Mining Machine Recoveries to Reduce Downtime
Process Mining Machine Recoveries to Reduce DowntimeProcess Mining Machine Recoveries to Reduce Downtime
Process Mining Machine Recoveries to Reduce Downtime
Process mining Evangelist
 
L1_Slides_Foundational Concepts_508.pptx
L1_Slides_Foundational Concepts_508.pptxL1_Slides_Foundational Concepts_508.pptx
L1_Slides_Foundational Concepts_508.pptx
38NoopurPatel
 
50_questions_full.pptxdddddddddddddddddd
50_questions_full.pptxdddddddddddddddddd50_questions_full.pptxdddddddddddddddddd
50_questions_full.pptxdddddddddddddddddd
emir73065
 
Ann Naser Nabil- Data Scientist Portfolio.pdf
Ann Naser Nabil- Data Scientist Portfolio.pdfAnn Naser Nabil- Data Scientist Portfolio.pdf
Ann Naser Nabil- Data Scientist Portfolio.pdf
আন্ নাসের নাবিল
 
problem solving.presentation slideshow bsc nursing
problem solving.presentation slideshow bsc nursingproblem solving.presentation slideshow bsc nursing
problem solving.presentation slideshow bsc nursing
vishnudathas123
 
Automated Melanoma Detection via Image Processing.pptx
Automated Melanoma Detection via Image Processing.pptxAutomated Melanoma Detection via Image Processing.pptx
Automated Melanoma Detection via Image Processing.pptx
handrymaharjan23
 
Dynamics 365 Business Rules Dynamics Dynamics
Dynamics 365 Business Rules Dynamics DynamicsDynamics 365 Business Rules Dynamics Dynamics
Dynamics 365 Business Rules Dynamics Dynamics
heyoubro69
 
Publication-launch-How-is-Life-for-Children-in-the-Digital-Age-15-May-2025.pdf
Publication-launch-How-is-Life-for-Children-in-the-Digital-Age-15-May-2025.pdfPublication-launch-How-is-Life-for-Children-in-the-Digital-Age-15-May-2025.pdf
Publication-launch-How-is-Life-for-Children-in-the-Digital-Age-15-May-2025.pdf
StatsCommunications
 
Sets theories and applications that can used to imporve knowledge
Sets theories and applications that can used to imporve knowledgeSets theories and applications that can used to imporve knowledge
Sets theories and applications that can used to imporve knowledge
saumyasl2020
 
Process Mining at Deutsche Bank - Journey
Process Mining at Deutsche Bank - JourneyProcess Mining at Deutsche Bank - Journey
Process Mining at Deutsche Bank - Journey
Process mining Evangelist
 
Controlling Financial Processes at a Municipality
Controlling Financial Processes at a MunicipalityControlling Financial Processes at a Municipality
Controlling Financial Processes at a Municipality
Process mining Evangelist
 
indonesia-gen-z-report-2024 Gen Z (born between 1997 and 2012) is currently t...
indonesia-gen-z-report-2024 Gen Z (born between 1997 and 2012) is currently t...indonesia-gen-z-report-2024 Gen Z (born between 1997 and 2012) is currently t...
indonesia-gen-z-report-2024 Gen Z (born between 1997 and 2012) is currently t...
disnakertransjabarda
 
Z14_IBM__APL_by_Christian_Demmer_IBM.pdf
Z14_IBM__APL_by_Christian_Demmer_IBM.pdfZ14_IBM__APL_by_Christian_Demmer_IBM.pdf
Z14_IBM__APL_by_Christian_Demmer_IBM.pdf
Fariborz Seyedloo
 
AWS Certified Machine Learning Slides.pdf
AWS Certified Machine Learning Slides.pdfAWS Certified Machine Learning Slides.pdf
AWS Certified Machine Learning Slides.pdf
philsparkshome
 
Lagos School of Programming Final Project Updated.pdf
Lagos School of Programming Final Project Updated.pdfLagos School of Programming Final Project Updated.pdf
Lagos School of Programming Final Project Updated.pdf
benuju2016
 
Automation Platforms and Process Mining - success story
Automation Platforms and Process Mining - success storyAutomation Platforms and Process Mining - success story
Automation Platforms and Process Mining - success story
Process mining Evangelist
 
RAG Chatbot using AWS Bedrock and Streamlit Framework
RAG Chatbot using AWS Bedrock and Streamlit FrameworkRAG Chatbot using AWS Bedrock and Streamlit Framework
RAG Chatbot using AWS Bedrock and Streamlit Framework
apanneer
 
national income & related aggregates (1)(1).pptx
national income & related aggregates (1)(1).pptxnational income & related aggregates (1)(1).pptx
national income & related aggregates (1)(1).pptx
j2492618
 
Fundamentals of Data Analysis, its types, tools, algorithms
Fundamentals of Data Analysis, its types, tools, algorithmsFundamentals of Data Analysis, its types, tools, algorithms
Fundamentals of Data Analysis, its types, tools, algorithms
priyaiyerkbcsc
 
2-Raction quotient_١٠٠١٤٦.ppt of physical chemisstry
2-Raction quotient_١٠٠١٤٦.ppt of physical chemisstry2-Raction quotient_١٠٠١٤٦.ppt of physical chemisstry
2-Raction quotient_١٠٠١٤٦.ppt of physical chemisstry
bastakwyry
 
Process Mining Machine Recoveries to Reduce Downtime
Process Mining Machine Recoveries to Reduce DowntimeProcess Mining Machine Recoveries to Reduce Downtime
Process Mining Machine Recoveries to Reduce Downtime
Process mining Evangelist
 
L1_Slides_Foundational Concepts_508.pptx
L1_Slides_Foundational Concepts_508.pptxL1_Slides_Foundational Concepts_508.pptx
L1_Slides_Foundational Concepts_508.pptx
38NoopurPatel
 
50_questions_full.pptxdddddddddddddddddd
50_questions_full.pptxdddddddddddddddddd50_questions_full.pptxdddddddddddddddddd
50_questions_full.pptxdddddddddddddddddd
emir73065
 
problem solving.presentation slideshow bsc nursing
problem solving.presentation slideshow bsc nursingproblem solving.presentation slideshow bsc nursing
problem solving.presentation slideshow bsc nursing
vishnudathas123
 
Automated Melanoma Detection via Image Processing.pptx
Automated Melanoma Detection via Image Processing.pptxAutomated Melanoma Detection via Image Processing.pptx
Automated Melanoma Detection via Image Processing.pptx
handrymaharjan23
 
Dynamics 365 Business Rules Dynamics Dynamics
Dynamics 365 Business Rules Dynamics DynamicsDynamics 365 Business Rules Dynamics Dynamics
Dynamics 365 Business Rules Dynamics Dynamics
heyoubro69
 
Publication-launch-How-is-Life-for-Children-in-the-Digital-Age-15-May-2025.pdf
Publication-launch-How-is-Life-for-Children-in-the-Digital-Age-15-May-2025.pdfPublication-launch-How-is-Life-for-Children-in-the-Digital-Age-15-May-2025.pdf
Publication-launch-How-is-Life-for-Children-in-the-Digital-Age-15-May-2025.pdf
StatsCommunications
 
Sets theories and applications that can used to imporve knowledge
Sets theories and applications that can used to imporve knowledgeSets theories and applications that can used to imporve knowledge
Sets theories and applications that can used to imporve knowledge
saumyasl2020
 
Controlling Financial Processes at a Municipality
Controlling Financial Processes at a MunicipalityControlling Financial Processes at a Municipality
Controlling Financial Processes at a Municipality
Process mining Evangelist
 
indonesia-gen-z-report-2024 Gen Z (born between 1997 and 2012) is currently t...
indonesia-gen-z-report-2024 Gen Z (born between 1997 and 2012) is currently t...indonesia-gen-z-report-2024 Gen Z (born between 1997 and 2012) is currently t...
indonesia-gen-z-report-2024 Gen Z (born between 1997 and 2012) is currently t...
disnakertransjabarda
 
Z14_IBM__APL_by_Christian_Demmer_IBM.pdf
Z14_IBM__APL_by_Christian_Demmer_IBM.pdfZ14_IBM__APL_by_Christian_Demmer_IBM.pdf
Z14_IBM__APL_by_Christian_Demmer_IBM.pdf
Fariborz Seyedloo
 
AWS Certified Machine Learning Slides.pdf
AWS Certified Machine Learning Slides.pdfAWS Certified Machine Learning Slides.pdf
AWS Certified Machine Learning Slides.pdf
philsparkshome
 

Funnel Analysis with Apache Spark and Druid

  • 2. @ItaiYaffe, @ettigur Digital advertising - a multi-billion dollar industry ● In 2019, internet advertising spending worldwide was over $290B(statista.com, June 2020)
  • 3. @ItaiYaffe, @ettigur Digital advertising - a multi-billion dollar industry ● In 2019, internet advertising spending worldwide was over $290B(statista.com, June 2020) ● Apple spent over $110M on iPhone & TV+ advertising during September and October 2019 (9To5Mac.com, November 2019)
  • 4. @ItaiYaffe, @ettigur Digital advertising - a multi-billion dollar industry ● In 2019, internet advertising spending worldwide was over $290B(statista.com, June 2020) ● Apple spent over $110M on iPhone & TV+ advertising during September and October 2019 (9To5Mac.com, November 2019) ● GM is shifting ‘SIGNIFICANT’ dollars to connected TV advertising (AdAge.com, January 2020)
  • 5. @ItaiYaffe, @ettigur Digital advertising - a multi-billion dollar industry ● In 2019, internet advertising spending worldwide was over $290B (statista.com, June 2020) ● Apple spent over $110M on iPhone & TV+ advertising during September and October 2019 (9To5Mac.com, November 2019) ● GM is shifting ‘SIGNIFICANT’ dollars to connected TV advertising (AdAge.com, January 2020) $$$ spent each year on digital advertising campaigns
  • 6. @ItaiYaffe, @ettigur What does a funnel look like? PRODUCT PAGE 10M HOMEPAGE 15M 5M Drop-off AD EXPOSURE 100M 85M Drop-off So everybody wants to measure their campaigns’ efficiency!
  • 7. @ItaiYaffe, @ettigur What does a funnel look like? PRODUCT PAGE 10M HOMEPAGE 15M 5M Drop-off AD EXPOSURE 100M 85M Drop-off So everybody wants to measure their campaigns’ efficiency! But how???
  • 8. Funnel Analysis with Apache Spark and Druid Etti Gur, Nielsen Itai Yaffe, Imply
  • 9. @ItaiYaffe, @ettigur Introduction Etti Gur ● Senior Big Data Engineer @ Nielsen ● Building data pipelines using Spark, Kafka, Druid, Airflow and more Etti Gur @ettigur Itai Yaffe ● Principal Solutions Architect @ Imply Prev. Big Data Tech Lead @ Nielsen ● Dealing with Big Data challenges since 2012 ● Itai Yaffe @ItaiYaffe
  • 10. @ItaiYaffe, @ettigur Nielsen Identity ● Data and Measurement company ● Media consumption ● Single source of truth of individuals and households ○ Unifies many proprietary datasets ○ Generates holistic view of a consumer
  • 11. @ItaiYaffe, @ettigur Nielsen Identity in numbers >10B events/day 60TB/day S3 6000 nodes/day 10’s of TB ingested/day druid
  • 13. @ItaiYaffe, @ettigur What you will learn? How to overcome the technical challenges of Funnel Analysis
  • 14. @ItaiYaffe, @ettigur What you will learn? How to overcome the technical challenges of Funnel Analysis using Apache Spark, Druid and DataSketches,
  • 15. @ItaiYaffe, @ettigur What you will learn? How to overcome the technical challenges of Funnel Analysis using Apache Spark, Druid and DataSketches, and why you should even care
  • 16. @ItaiYaffe, @ettigur Campaign phases - user’s point-of-view Awareness Exposed to campaign (e.g via online ad) Consideration Interest is expressed (e.g clicked ad) Intent Steps taken towards making a purchase (e.g added product to cart) Purchase
  • 17. @ItaiYaffe, @ettigur Campaign phases - user’s point-of-view Awareness Exposed to campaign (e.g via online ad) Consideration Interest is expressed (e.g clicked ad) Intent Steps taken towards making a purchase (e.g added product to cart) Purchase Tactic Stages
  • 18. @ItaiYaffe, @ettigur Campaign phases - campaign owner’s point-of-view Awareness Consideration Intent Purchase Drop- off Drop- off Drop- off
  • 19. @ItaiYaffe, @ettigur PRODUCT PAGE 10M UUs HOMEPAGE 15M UUs 7M Drop-off 5M Drop-off AD EXPOSURE 100M UUs 85M Drop-off Campaign phases - why is it called “a funnel”? * UUs = Unique Users CHECKOUT 3M UUs
  • 20. @ItaiYaffe, @ettigur PRODUCT PAGE 10M UUs HOMEPAGE 15M UUs 7M Drop-off 5M Drop-off AD EXPOSURE 100M UUs 85M Drop-off Campaign phases - why is it called “a funnel”? * UUs = Unique Users CHECKOUT 3M UUs We need to analyze the funnel, hence: “Funnel Analysis”
  • 21. @ItaiYaffe, @ettigur Views vs Unique Users 2 Unique Users 7 Views 2 Purchases $$$ $$$
  • 22. @ItaiYaffe, @ettigur Everybody wants to measure their campaigns’ efficiency! What does a funnel look like? PRODUCT PAGE 10M HOMEPAGE 15M 5M Drop-off AD EXPOSURE 100M 85M Drop-off
  • 23. @ItaiYaffe, @ettigur But how can one measure campaign efficiency? ● Collect a huge stream of events (i.e user activities) while the campaign is live ● Map events to funnel stages ○ E.g ad exposure = tactic ● Provide insights quickly
  • 24. @ItaiYaffe, @ettigur So… what’s wrong with off-the-shelf alternatives? Topic Off-the-shelf alternatives Scalability Limited Access to raw data Lack access Count-distinct operations Very slow * Based on tinyurl.com/qqza5ur
  • 26. @ItaiYaffe, @ettigur Why is it cool? ● Store trillions of events, petabytes of data ● Sub-second analytic queries ● Highly scalable ● Cost effective ● Decoupled architecture ○ E.g ingestion is separated from query
  • 27. @ItaiYaffe, @ettigur Roll-up - Simple Count (Views) LongSumAggregator 2021-05-26 Timestamp Website Device ID www.a.com 3a4c1f2d84a5c179435c1fea86e6ae02 2021-05-26 www.a.com 3a4c1f2d84a5c179435c1fea86e6ae02 2021-05-26 www.a.com 5dd59f9bd068f802a7c6dd832bf60d02 2021-05-26 www.b.com 5dd59f9bd068f802a7c6dd832bf60d02 2021-05-26 www.c.com 5dd59f9bd068f802a7c6dd832bf60d02 Timestamp Website Views 2021-05-26 2021-05-26 2021-05-26 www.a.com 3 1 1 www.b.com www.c.com
  • 30. @ItaiYaffe, @ettigur Common use-cases for Druid ● Clickstream analytics ○ Funnel analysis ● Network performance monitoring ● Application performance management ● Supply chain analytics ○ Manufacturing (IoT and device) metrics ● BI and OLAP ● And more...
  • 31. @ItaiYaffe, @ettigur Druid in a nutshell ● A real-time analytics database ○ Time-series, columnar ● Can ingest and store trillions of events, and serve analytic queries in sub-second ● Highly-scalable, cost-effective ● Widely used among Big Data companies for: ○ Application performance management ○ Clickstream analytics and funnel analysis ○ And more
  • 32. @ItaiYaffe, @ettigur Druid in a nutshell ● A real-time analytics database ○ Time-series, columnar ● Can ingest and store trillions of events, and serve analytic queries in sub-second ● Highly-scalable, cost-effective ● Widely used among Big Data companies for: ○ Application performance management ○ Clickstream analytics and funnel analysis ○ And more
  • 33. @ItaiYaffe, @ettigur Why is Druid suitable for the task? Topic Off-the-shelf alternatives Druid Scalability Limited Highly scalable Access to raw data Lack access Can store trillions of events Count-distinct operations Very slow Sub-second approximate count distinct with set operations using the Theta Sketch module * Based on tinyurl.com/qqza5ur
  • 34. @ItaiYaffe, @ettigur Why is Druid suitable for the task? Topic Off-the-shelf alternatives Druid Scalability Limited Highly scalable Access to raw data Lack access Can store trillions of events Count-distinct operations Very slow Sub-second approximate count distinct with set operations using the Theta Sketch module Theta Sketch??? * Based on tinyurl.com/qqza5ur
  • 35. @ItaiYaffe, @ettigur What is Theta Sketch? ● ThetaSketch mathematical framework - generalization of KMV ● K Minimum Values (KMV) ● Estimate set cardinality ● Supports set-theoretic operations
  • 36. Number of Std Dev 1 2 Confidence Interval 68.27% 95.45% 16,384 0.78% 1.56% 32,768 0.55% 1.10% 65,536 0.39% 0.78% Error as function of K Theta Sketch error * Larger K = more memory & storage needed @ItaiYaffe
  • 37. @ItaiYaffe, @ettigur Theta Sketch demo tinyurl.com/ugk6p67
  • 38. @ItaiYaffe, @ettigur The Theta Sketch module in Druid ● Part of the Apache DataSketches library (datasketches.apache.org) ● At ingestion time ○ Sketches are created and stored in Druid segments ● At query time ○ Sketches are aggregated (i.e union, intersection or difference between sketches) ○ The result - estimated number of unique entries in the aggregated sketch ● Also see this short video - tinyurl.com/vdwojh6
  • 39. @ItaiYaffe, @ettigur Roll-up - Count Distinct (Unique Users) 2021-05-26 Timestamp Website Device ID www.a.com 3a4c1f2d84a5c179435c1fea86e6ae02 2021-05-26 www.a.com 3a4c1f2d84a5c179435c1fea86e6ae02 2021-05-26 www.a.com 5dd59f9bd068f802a7c6dd832bf60d02 2021-05-26 www.b.com 5dd59f9bd068f802a7c6dd832bf60d02 2021-05-26 www.c.com 5dd59f9bd068f802a7c6dd832bf60d02 Timestamp Website Unique Users* 2021-05-26 2021-05-26 2021-05-26 www.a.com 2* 1* 1* www.b.com www.c.com ThetaSketchAggregator * What is actually stored is a ThetaSketch object. The actual result is calculated in real-time, which allows us to do UNIONs and INTERSECTIONs
  • 40. @ItaiYaffe, @ettigur Cool, so… Back to funnel analysis?
  • 41. @ItaiYaffe, @ettigur Funnel analysis - simple use-case How many unique users viewed online ad? VS How many unique users viewed online ad AND viewed product X page?
  • 42. @ItaiYaffe, @ettigur Funnel analysis - simple use-case 5/1/2021 - 5/26/2021 5/1/2021 - 5/26/2021
  • 44. @ItaiYaffe, @ettigur Funnel analysis pipeline - high-level architecture 1. Read files of last day Data Lake 2. Write files by campaign,date Campaigns' marts 3. Read files per campaign 4. Write files by date,campaign Enriched data 5. Load data by campaign Mart Generator Enricher
  • 45. @ItaiYaffe, @ettigur Funnel analysis pipeline - Data Lake 1. Read files of last day Data Lake 2. Write files by campaign,date Campaigns' marts 3. Read files per campaign 4. Write files by date,campaign Enriched data 5. Load data by campaign Mart Generator Enricher {event_time=2021-05-26T..., userid=uid1, attribute=online_ad} {event_time=2021-05-26T..., userid=uid1, attribute=homepage} {event_time=2021-05-26T..., userid=uid1, attribute=productX_page} .... date=2021-05-24 date=2021-05-25 date=2021-05-26
  • 46. @ItaiYaffe, @ettigur Funnel analysis pipeline - Mart Generator 1. Read files of last day Data Lake 2. Write files by campaign,date Campaigns' marts 3. Read files per campaign 4. Write files by date,campaign Enriched data 5. Load data by campaign Mart Generator Enricher campaign=1210 campaign=1319 campaign=1472 date=2021-05-24 date=2021-05-25 date=2021-05-26 {event_time=2021-05-26T... , userid=uid1, attribute=online_ad, type=Tactic} {event_time=2021-05-26T... , userid=uid1, attribute=homepage, type=Stage} {event_time=2021-05-26T... , userid=uid1, attribute=productX_page , type=Stage} ....
  • 47. @ItaiYaffe, @ettigur Funnel analysis pipeline - Enricher 1. Read files of last day Data Lake 2. Write files by campaign,date Campaigns' marts 3. Read files per campaign 4. Write files by date,campaign Enriched data 5. Load data by campaign Mart Generator Enricher campaign=1210 campaign=1319 campaign=1472 date=2021-05-24 date=2021-05-25 date=2021-05-26 {event_date=2021-05-26, userid=uid1, tactic=online_ad, stage=homepage} {event_date=2021-05-26, userid=uid1, tactic=online_ad, stage=productX_page } .... ....
  • 48. @ItaiYaffe, @ettigur Funnel analysis pipeline - ingesting data into Druid 1. Read files of last day Data Lake 2. Write files by campaign,date Campaigns' marts 3. Read files per campaign 4. Write files by date,campaign Enriched data 5. Load data by campaign Mart Generator Enricher "type": "index_hadoop", "spec": { "dataSchema": { "dataSource": "campaign_1472", "granularitySpec": { "queryGranularity": "day", "segmentGranularity": "day", "type": "uniform", "intervals": ["2021-05-01/2021-05-27"] ...
  • 49. @ItaiYaffe, @ettigur Funnel analysis pipeline - ingesting data into Druid 1. Read files of last day Data Lake 2. Write files by campaign,date Campaigns' marts 3. Read files per campaign 4. Write files by date,campaign Enriched data 5. Load data by campaign Mart Generator Enricher "timestampSpec": { "column": "event_date", "format": "yyyy-MM-dd" }, "dimensionsSpec": { "dimensions": ["tactic", "stage"] }, "metricsSpec": [{ "fieldName": "userid", "type": "thetaSketch", "name": "user_id_sketch", "size": 65536}], ...
  • 50. @ItaiYaffe, @ettigur Funnel analysis pipeline - ingesting data into Druid 1. Read files of last day Data Lake 2. Write files by campaign,date Campaigns' marts 3. Read files per campaign 4. Write files by date,campaign Enriched data 5. Load data by campaign Mart Generator Enricher "inputSpec": {"type": " multi", "children": [ {"type": " dataSource", "ingestionSpec": { "intervals": ["2021-05-01/2021-05-27"], "dataSource": "campaign_1472", ...}}, {"type": " static", "Paths": "s3://<BUCKET_NAME>/date=2021-05-26/campaign=1472", ...}, ...
  • 51. @ItaiYaffe, @ettigur Funnel analysis pipeline - Druid datasources 1. Read files of last day Data Lake 2. Write files by campaign,date Campaigns' marts 3. Read files per campaign 4. Write files by date,campaign Enriched data 5. Load data by campaign Mart Generator Enricher {__time=2021-05-26, tactic=online_ad, stage=homepage, user_id_sketch=<Object>} {__time=2021-05-26, tactic=online_ad, stage=productX_page , user_id_sketch=<Object>} .... .... campaign_1210 campaign_1319 campaign_1472
  • 52. @ItaiYaffe, @ettigur Funnel analysis pipeline - querying Druid (SQL) 1. Read files of last day Data Lake 2. Write files by campaign,date Campaigns' marts 3. Read files per campaign 4. Write files by date,campaign Enriched data 5. Load data by campaign Mart Generator Enricher SELECT APPROX_COUNT_DISTINCT_DS_THETA(user_id_sketch,65536) as homepage_sketch FROM campaign_1472 WHERE (("tactic" = 'online_ad') AND ("stage" = 'homepage')) AND __time BETWEEN '2021-05-01T00:00:00.000' AND '2021-05-26T23:59:59.000' * This specific query returns the estimated number of unique users that viewed the online ad AND viewed the homepage
  • 53. @ItaiYaffe, @ettigur Funnel analysis - simple use-case revisited 5/1/2021 - 5/26/2021
  • 54. @ItaiYaffe, @ettigur Funnel analysis - simple use-case revisited
  • 55. @ItaiYaffe, @ettigur Funnel analysis - simple use-case revisited
  • 56. @ItaiYaffe, @ettigur Funnel analysis - simple use-case revisited 3,100 - 2,500 != 1000
  • 57. @ItaiYaffe, @ettigur PRODUCT PAGE 1K UUs ... HOMEPAGE 3.1K UUs 2.5K Drop-off ONLINE AD 8.1M UUs Funnel analysis - simple use-case revisited * UUs = Unique Users
  • 58. @ItaiYaffe, @ettigur PRODUCT PAGE 1K UUs ... HOMEPAGE 3.1K UUs ONLINE AD 8.1M UUs Funnel analysis - simple use-case revisited * UUs = Unique Users 2.5K Drop-off
  • 59. @ItaiYaffe, @ettigur Funnel analysis - simple complex use-case How many unique users viewed online ad? VS How many unique users viewed online ad FIRST and THEN viewed product X page?
  • 60. @ItaiYaffe, @ettigur Funnel analysis - complex use-case ● This is what we call a sequential funnel ○ Chronological order of events is important ● The data pipeline is very similar, but… ○ Taking into account only events that happened in the pre-defined order of the funnel ● That way we better represent the efficiency of a specific tactic (i.e advertisement)
  • 61. @ItaiYaffe, @ettigur Funnel analysis pipeline - reminder 1. Read files of last day Data Lake 2. Write files by campaign,date Campaigns' marts 3. Read files per campaign 4. Write files by date,campaign Enriched data 5. Load data by campaign Mart Generator Enricher
  • 62. @ItaiYaffe, @ettigur Funnel analysis pipeline - Data Lake 1. Read files of last day Data Lake 2. Write files by campaign,date Campaigns' marts 3. Read files per campaign 4. Write files by date,campaign Enriched data 5. Load data by campaign Mart Generator Enricher {event_time=2021-05-26T09:15, userid=uid1, attribute=productX_page} {event_time=2021-05-26T10:10, userid=uid1, attribute=online_ad} {event_time=2021-05-26T10:11, userid=uid1, attribute=homepage} .... date=2021-05-24 date=2021-05-25 date=2021-05-26
  • 63. @ItaiYaffe, @ettigur Funnel analysis pipeline - Mart Generator 1. Read files of last day Data Lake 2. Write files by campaign,date Campaigns' marts 3. Read files per campaign 4. Write files by date,campaign Enriched data 5. Load data by campaign Mart Generator Enricher campaign=1210 campaign=1319 campaign=1472 date=2021-05-24 date=2021-05-25 date=2021-05-26 {event_time=2021-05-26T09:15 , userid=uid1, attribute=productX_page , type=Stage} {event_time=2021-05-26T10:10 , userid=uid1, attribute=online_ad, type=Tactic} {event_time=2021-05-26T10:11 , userid=uid1, attribute=homepage, type=Stage} ....
  • 64. @ItaiYaffe, @ettigur Funnel analysis pipeline - Enricher 1. Read files of last day Data Lake 2. Write files by campaign,date Campaigns' marts 3. Read files per campaign 4. Write files by date,campaign Enriched data 5. Load data by campaign Mart Generator Enricher campaign=1210 campaign=1319 campaign=1472 date=2021-05-24 date=2021-05-25 date=2021-05-26 {event_date=2021-05-26, userid=uid1, tactic=online_ad, stage=productX_page } {event_date=2021-05-26, userid=uid1, tactic=online_ad, stage=homepage} ....
  • 65. @ItaiYaffe, @ettigur Funnel analysis pipeline - Enricher 1. Read files of last day Data Lake 2. Write files by campaign,date Campaigns' marts 3. Read files per campaign 4. Write files by date,campaign Enriched data 5. Load data by campaign Mart Generator Enricher campaign=1210 campaign=1319 campaign=1472 date=2021-05-24 date=2021-05-25 date=2021-05-26 {event_date=2021-05-26, userid=uid1, tactic=online_ad, stage=productX_page } {event_date=2021-05-26, userid=uid1, tactic=online_ad, stage=homepage} ....
  • 66. @ItaiYaffe, @ettigur Funnel analysis pipeline - Enricher 1. Read files of last day Data Lake 2. Write files by campaign,date Campaigns' marts 3. Read files per campaign 4. Write files by date,campaign Enriched data 5. Load data by campaign Mart Generator Enricher campaign=1210 campaign=1319 campaign=1472 date=2021-05-24 date=2021-05-25 date=2021-05-26 {event_date=2021-05-26, userid=uid1, tactic=online_ad, stage=homepage} .... ....
  • 67. @ItaiYaffe, @ettigur Funnel analysis pipeline - querying Druid (SQL) 1. Read files of last day Data Lake 2. Write files by campaign,date Campaigns' marts 3. Read files per campaign 4. Write files by date,campaign Enriched data 5. Load data by campaign Mart Generator Enricher SELECT APPROX_COUNT_DISTINCT_DS_THETA(THETA_SKETCH_NOT(65536, THETA_SKETCH_INTERSECT(65536,a,b), THETA_SKETCH_UNION(65536,c,d,e))) as dropoff_sketch FROM ( SELECT DS_THETA("user_id_sketch") FILTER (WHERE tactic = 'online_ad') as a, DS_THETA("user_id_sketch") FILTER (WHERE stage = 'homepage') as b, DS_THETA("user_id_sketch") FILTER (WHERE stage = 'productX_page') as c, DS_THETA("user_id_sketch") FILTER (WHERE stage = 'add_to_cart') as d, DS_THETA("user_id_sketch") FILTER (WHERE stage = 'checkout') as e FROM campaign_1472 WHERE stage in ('homepage','productX_page','add_to_cart','checkout') AND tactic = 'online_ad' AND __time BETWEEN '2021-05-01T00:00:00.000' AND '2021-05-26T23:59:59.000' ) subquery * This specific query should return the estimated number of unique users for the drop-off between the homepage and product X page
  • 68. @ItaiYaffe, @ettigur Funnel analysis - complex use-case 5/1/2021 - 5/26/2021
  • 71. @ItaiYaffe, @ettigur Funnel analysis - complex use-case 3,100 - 2,500 = 600
  • 72. @ItaiYaffe, @ettigur PRODUCT PAGE 0.6K UUs ... HOMEPAGE 3.1K UUs 2.5K Drop-off ONLINE AD 8.1M UUs Funnel analysis - complex use-case * UUs = Unique Users
  • 73. @ItaiYaffe, @ettigur PRODUCT PAGE 0.6K UUs ... HOMEPAGE 3.1K UUs ONLINE AD 8.1M UUs Funnel analysis - complex use-case * UUs = Unique Users 2.5K Drop-off
  • 74. @ItaiYaffe, @ettigur A few tips ● Use Druid with Theta Sketch for fast approximate count distinct ○ Allows set operations (intersection/union/negation) ● Use Spark to pre-process incoming events ○ Allows you to take into account only events that happened in the pre-defined order of the funnel ○ Check out Etti’s “Optimizing Spark-based data pipelines” talk (video - tinyurl.com/7hvyxtc8, slides - tinyurl.com/3rvc9mus) ● Optimize your ingestion process ○ Write Theta Sketch objects from Spark app ○ Load to Druid using isInputThetaSketch=true flag
  • 75. @ItaiYaffe, @ettigur What have we learned? ● Funnel analysis ○ Very important for advertisers ○ Not easy to solve technically (especially if chronological order of events matters)
  • 76. @ItaiYaffe, @ettigur What have we learned? ● Funnel analysis ○ Very important for advertisers ○ Not easy to solve technically (especially if chronological order of events matters) ● Druid is a very powerful tool for real-time analytics ○ Highly scalable, can ingest and store trillions of events, and serve analytic queries in sub-second ○ Used for many different use-cases
  • 77. @ItaiYaffe, @ettigur What have we learned? ● Funnel analysis ○ Very important for advertisers ○ Not easy to solve technically (especially if chronological order of events matters) ● Druid is a very powerful tool for real-time analytics ○ Highly scalable, can ingest and store trillions of events, and serve analytic queries in sub-second ○ Used for many different use-cases ● Combining Apache Spark, Druid and DataSkecthes FTW! ○ Pre-process events before ingesting into Druid ○ Decide how to handle out-of-order events
  • 78. @ItaiYaffe, @ettigur DRUID ES Want to know more? ● Women in Big Data ○ A world-wide program that aims : ■ To inspire, connect, grow, and champion success of women in the Big Data & analytics field ○ 30+ chapters and 17,000+ members world-wide ○ Everyone can join (regardless of gender), so find a chapter near you - www.womeninbigdata.org/membership/ ● Conference talks ○ Casting the Spell: Druid in Practice (Berlin Buzzwords, June 17th 2021) - tinyurl.com/559hufnj ○ Migrating Airflow-based Spark Jobs to K8s (Data+AI Summit Europe 2020) - tinyurl.com/cbm42mn8 ● Our Tech Blog - medium.com/nmc-techblog ○ Data Retention and Deletion in Apache Druid - tinyurl.com/yymrvrn2
  • 80. THANK YOU Etti Gur Etti Gur Itai Yaffe Itai Yaffe
  • 81. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.
  翻译: