Funnel Analysis with Apache Spark and Druid

@ItaiYaffe, @ettigur
Digital advertising - a multi-billion dollar industry
● In 2019, internet advertising spending worldwide was over
$290B(statista.com, June 2020)

● Apple spent over $110M on iPhone & TV+ advertising during
September and October 2019 (9To5Mac.com, November 2019)

● GM is shifting ‘SIGNIFICANT’ dollars to connected TV advertising
(AdAge.com, January 2020)

$290B (statista.com, June 2020)
● GM is shifting ‘SIGNIFICANT’ dollars to connected TV advertising
(AdAge.com, January 2020)
$$$ spent each year
on digital advertising campaigns

What does a funnel look like?
PRODUCT PAGE
10M
HOMEPAGE
15M
5M
Drop-off
AD EXPOSURE
100M
85M
Drop-off
So everybody wants to measure their campaigns’ efficiency!

PRODUCT PAGE
10M
HOMEPAGE
15M
5M
Drop-off
AD EXPOSURE
100M
85M
Drop-off
So everybody wants to measure their campaigns’ efficiency!
But how???

Funnel Analysis with
Apache Spark and Druid
Etti Gur, Nielsen
Itai Yaffe, Imply

Introduction
Etti Gur
● Senior Big Data Engineer @ Nielsen
● Building data pipelines using Spark,
Kafka, Druid, Airflow and more
Etti Gur @ettigur
Itai Yaffe
● Principal Solutions Architect @ Imply
Prev. Big Data Tech Lead @ Nielsen
● Dealing with Big Data challenges since 2012
● Itai Yaffe @ItaiYaffe

Nielsen Identity
● Data and Measurement company
● Media consumption
● Single source of truth of individuals and households
○ Unifies many proprietary datasets
○ Generates holistic view of a consumer

Nielsen Identity in numbers
>10B events/day 60TB/day
S3
6000 nodes/day
10’s of TB
ingested/day
druid

Scalability
Cost Efficiency
Fault-tolerance
The challenges

What you will learn?
How to overcome the technical challenges of Funnel Analysis

using Apache Spark, Druid and DataSketches,

using Apache Spark, Druid and DataSketches,
and why you should even care

Campaign phases - user’s point-of-view
Awareness
Exposed to
campaign (e.g
via online ad)
Consideration
Interest is
expressed (e.g
clicked ad)
Intent
Steps taken towards
making a purchase (e.g
added product to cart)
Purchase

Campaign phases - user’s point-of-view
Awareness
Exposed to
campaign (e.g
via online ad)
Consideration
Interest is
expressed (e.g
clicked ad)
Intent
Steps taken towards
making a purchase (e.g
added product to cart)
Purchase
Tactic Stages

Campaign phases - campaign owner’s point-of-view
Awareness Consideration Intent Purchase
Drop-
off
Drop-
off
Drop-
off

PRODUCT PAGE
10M UUs
HOMEPAGE
15M UUs
7M
Drop-off
5M
Drop-off
AD EXPOSURE
100M UUs
85M
Drop-off
Campaign phases - why is it called “a funnel”?
* UUs = Unique Users
CHECKOUT
3M UUs

PRODUCT PAGE
10M UUs
HOMEPAGE
15M UUs
7M
Drop-off
5M
Drop-off
AD EXPOSURE
100M UUs
85M
Drop-off
Campaign phases - why is it called “a funnel”?
CHECKOUT
3M UUs
We need to analyze the funnel, hence:
“Funnel Analysis”

Views vs Unique Users
2 Unique Users
7 Views
2 Purchases $$$ $$$

Everybody wants to measure their campaigns’ efficiency!
PRODUCT PAGE
10M
HOMEPAGE
15M
5M
Drop-off
AD EXPOSURE
100M
85M
Drop-off

But how can one measure campaign efficiency?
● Collect a huge stream of events (i.e user activities)
while the campaign is live
● Map events to funnel stages
○ E.g ad exposure = tactic
● Provide insights quickly

So… what’s wrong with off-the-shelf alternatives?
Topic Off-the-shelf alternatives
Scalability Limited
Access to raw data Lack access
Count-distinct operations Very slow
* Based on tinyurl.com/qqza5ur

Introducing: Apache Druid

Why is it cool?
● Store trillions of events, petabytes of data
● Sub-second analytic queries
● Highly scalable
● Cost effective
● Decoupled architecture
○ E.g ingestion is separated from query

Roll-up - Simple Count (Views)
LongSumAggregator
2021-05-26
Timestamp Website Device ID
www.a.com 3a4c1f2d84a5c179435c1fea86e6ae02
2021-05-26 www.a.com 3a4c1f2d84a5c179435c1fea86e6ae02
2021-05-26 www.a.com 5dd59f9bd068f802a7c6dd832bf60d02
2021-05-26 www.b.com 5dd59f9bd068f802a7c6dd832bf60d02
2021-05-26 www.c.com 5dd59f9bd068f802a7c6dd832bf60d02
Timestamp Website Views
2021-05-26
2021-05-26
2021-05-26
www.a.com 3
1
1
www.b.com
www.c.com

Druid architecture

Powered by Druid

Common use-cases for Druid
● Clickstream analytics
○ Funnel analysis
● Network performance monitoring
● Application performance management
● Supply chain analytics
○ Manufacturing (IoT and device) metrics
● BI and OLAP
● And more...

Druid in a nutshell
● A real-time analytics database
○ Time-series, columnar
● Can ingest and store trillions of events, and serve analytic queries in
sub-second
● Highly-scalable, cost-effective
● Widely used among Big Data companies for:
○ Application performance management
○ Clickstream analytics and funnel analysis
○ And more

Why is Druid suitable for the task?
Topic Off-the-shelf
alternatives
Druid
Scalability Limited Highly scalable
Access to raw
data
Lack access Can store trillions of events
Count-distinct
operations
Very slow Sub-second approximate count distinct
with set operations
using the Theta Sketch module

Why is Druid suitable for the task?
Topic Off-the-shelf
alternatives
Druid
Scalability Limited Highly scalable
Access to raw
data
Lack access Can store trillions of events
Count-distinct
operations
Very slow Sub-second approximate count distinct
with set operations
using the Theta Sketch module
Theta Sketch???

What is Theta Sketch?
● ThetaSketch mathematical framework - generalization of KMV
● K Minimum Values (KMV)
● Estimate set cardinality
● Supports set-theoretic operations

Number of Std Dev 1 2
Confidence Interval 68.27% 95.45%
16,384 0.78% 1.56%
32,768 0.55% 1.10%
65,536 0.39% 0.78%
Error as function of K
Theta Sketch error
* Larger K = more memory & storage needed
@ItaiYaffe

Theta Sketch demo
tinyurl.com/ugk6p67

The Theta Sketch module in Druid
● Part of the Apache DataSketches library (datasketches.apache.org)
● At ingestion time
○ Sketches are created and stored in Druid segments
● At query time
○ Sketches are aggregated (i.e union, intersection or difference
between sketches)
○ The result - estimated number of unique entries in the aggregated
sketch
● Also see this short video - tinyurl.com/vdwojh6

Roll-up - Count Distinct (Unique Users)
2021-05-26
Timestamp Website Device ID
www.a.com 3a4c1f2d84a5c179435c1fea86e6ae02
2021-05-26 www.a.com 3a4c1f2d84a5c179435c1fea86e6ae02
2021-05-26 www.a.com 5dd59f9bd068f802a7c6dd832bf60d02
2021-05-26 www.b.com 5dd59f9bd068f802a7c6dd832bf60d02
2021-05-26 www.c.com 5dd59f9bd068f802a7c6dd832bf60d02
Timestamp Website Unique Users*
2021-05-26
2021-05-26
2021-05-26
www.a.com 2*
1*
1*
www.b.com
www.c.com
ThetaSketchAggregator
* What is actually stored is a
ThetaSketch object.
The actual result is calculated
in real-time, which allows us
to do UNIONs and INTERSECTIONs

Cool, so… Back to funnel analysis?

Funnel analysis - simple use-case
How many unique users viewed online ad?
VS
How many unique users viewed
online ad AND viewed product X page?

5/1/2021 - 5/26/2021
5/1/2021 - 5/26/2021

Funnel analysis pipeline - high-level architecture
1.
Read files
of last day
Data
Lake
2.
Write files by
campaign,date
Campaigns'
marts
3.
Read files
per campaign
4.
Write files by
date,campaign
Enriched
data
5.
Load data
by campaign
Mart Generator Enricher

Funnel analysis pipeline - Data Lake
1.
Read files
of last day
Data
Lake
2.
Write files by
campaign,date
Campaigns'
marts
3.
Read files
per campaign
4.
Write files by
date,campaign
Enriched
data
5.
Load data
by campaign
{event_time=2021-05-26T..., userid=uid1, attribute=online_ad}
{event_time=2021-05-26T..., userid=uid1, attribute=homepage}
{event_time=2021-05-26T..., userid=uid1, attribute=productX_page}
....
date=2021-05-24
date=2021-05-25
date=2021-05-26

Funnel analysis pipeline - Mart Generator
1.
Read files
of last day
Data
Lake
2.
Write files by
campaign,date
Campaigns'
marts
3.
Read files
per campaign
4.
Write files by
date,campaign
Enriched
data
5.
Load data
by campaign
campaign=1210
campaign=1319
campaign=1472
date=2021-05-24
date=2021-05-25
date=2021-05-26
{event_time=2021-05-26T... , userid=uid1, attribute=online_ad, type=Tactic}
{event_time=2021-05-26T... , userid=uid1, attribute=homepage, type=Stage}
{event_time=2021-05-26T... , userid=uid1, attribute=productX_page , type=Stage}
....

Funnel analysis pipeline - Enricher
1.
Read files
of last day
Data
Lake
2.
Write files by
campaign,date
Campaigns'
marts
3.
Read files
per campaign
4.
Write files by
date,campaign
Enriched
data
5.
Load data
by campaign
campaign=1210
campaign=1319
campaign=1472
date=2021-05-24
date=2021-05-25
date=2021-05-26
{event_date=2021-05-26, userid=uid1, tactic=online_ad, stage=homepage}
{event_date=2021-05-26, userid=uid1, tactic=online_ad, stage=productX_page }
....
....

Funnel analysis pipeline - ingesting data into Druid
1.
Read files
of last day
Data
Lake
2.
Write files by
campaign,date
Campaigns'
marts
3.
Read files
per campaign
4.
Write files by
date,campaign
Enriched
data
5.
Load data
by campaign
"type": "index_hadoop",
"spec": {
"dataSchema": {
"dataSource": "campaign_1472",
"granularitySpec": {
"queryGranularity": "day",
"segmentGranularity": "day",
"type": "uniform",
"intervals": ["2021-05-01/2021-05-27"]
...

1.
Read files
of last day
Data
Lake
2.
Write files by
campaign,date
Campaigns'
marts
3.
Read files
per campaign
4.
Write files by
date,campaign
Enriched
data
5.
Load data
by campaign
"timestampSpec": {
"column": "event_date", "format": "yyyy-MM-dd"
},
"dimensionsSpec": {
"dimensions": ["tactic", "stage"]
},
"metricsSpec": [{
"fieldName": "userid", "type": "thetaSketch",
"name": "user_id_sketch", "size": 65536}],
...

1.
Read files
of last day
Data
Lake
2.
Write files by
campaign,date
Campaigns'
marts
3.
Read files
per campaign
4.
Write files by
date,campaign
Enriched
data
5.
Load data
by campaign
"inputSpec": {"type": " multi",
"children": [
{"type": " dataSource",
"ingestionSpec": {
"intervals": ["2021-05-01/2021-05-27"],
"dataSource": "campaign_1472", ...}},
{"type": " static",
"Paths": "s3://<BUCKET_NAME>/date=2021-05-26/campaign=1472",
...},
...

Funnel analysis pipeline - Druid datasources
1.
Read files
of last day
Data
Lake
2.
Write files by
campaign,date
Campaigns'
marts
3.
Read files
per campaign
4.
Write files by
date,campaign
Enriched
data
5.
Load data
by campaign
{__time=2021-05-26, tactic=online_ad, stage=homepage, user_id_sketch=<Object>}
{__time=2021-05-26, tactic=online_ad, stage=productX_page , user_id_sketch=<Object>}
....
....
campaign_1210
campaign_1319
campaign_1472

Funnel analysis pipeline - querying Druid (SQL)
1.
Read files
of last day
Data
Lake
2.
Write files by
campaign,date
Campaigns'
marts
3.
Read files
per campaign
4.
Write files by
date,campaign
Enriched
data
5.
Load data
by campaign
SELECT
APPROX_COUNT_DISTINCT_DS_THETA(user_id_sketch,65536)
as homepage_sketch
FROM campaign_1472
WHERE (("tactic" = 'online_ad')
AND ("stage" = 'homepage'))
AND __time BETWEEN '2021-05-01T00:00:00.000'
AND '2021-05-26T23:59:59.000'
* This specific query returns the estimated number of unique users
that viewed the online ad AND viewed the homepage

Funnel analysis - simple use-case revisited
5/1/2021 - 5/26/2021

3,100 - 2,500 != 1000

PRODUCT PAGE
1K UUs
...
HOMEPAGE
3.1K UUs
2.5K
Drop-off
ONLINE AD
8.1M UUs

PRODUCT PAGE
1K UUs
...
HOMEPAGE
3.1K UUs
ONLINE AD
8.1M UUs
2.5K
Drop-off

Funnel analysis - simple complex use-case
How many unique users viewed online ad?
VS
How many unique users
viewed online ad FIRST and
THEN viewed product X page?

Funnel analysis - complex use-case
● This is what we call a sequential funnel
○ Chronological order of events is important
● The data pipeline is very similar, but…
○ Taking into account only events that happened in the pre-defined order
of the funnel
● That way we better represent the efficiency of a specific tactic
(i.e advertisement)

Funnel analysis pipeline - reminder
1.
Read files
of last day
Data
Lake
2.
Write files by
campaign,date
Campaigns'
marts
3.
Read files
per campaign
4.
Write files by
date,campaign
Enriched
data
5.
Load data
by campaign

Funnel analysis pipeline - Data Lake
1.
Read files
of last day
Data
Lake
2.
Write files by
campaign,date
Campaigns'
marts
3.
Read files
per campaign
4.
Write files by
date,campaign
Enriched
data
5.
Load data
by campaign
{event_time=2021-05-26T09:15, userid=uid1, attribute=productX_page}
{event_time=2021-05-26T10:10, userid=uid1, attribute=online_ad}
{event_time=2021-05-26T10:11, userid=uid1, attribute=homepage}
....
date=2021-05-24
date=2021-05-25
date=2021-05-26

Funnel analysis pipeline - Mart Generator
1.
Read files
of last day
Data
Lake
2.
Write files by
campaign,date
Campaigns'
marts
3.
Read files
per campaign
4.
Write files by
date,campaign
Enriched
data
5.
Load data
by campaign
campaign=1210
campaign=1319
campaign=1472
date=2021-05-24
date=2021-05-25
date=2021-05-26
{event_time=2021-05-26T09:15 , userid=uid1, attribute=productX_page , type=Stage}
{event_time=2021-05-26T10:10 , userid=uid1, attribute=online_ad, type=Tactic}
{event_time=2021-05-26T10:11 , userid=uid1, attribute=homepage, type=Stage}
....

1.
Read files
of last day
Data
Lake
2.
Write files by
campaign,date
Campaigns'
marts
3.
Read files
per campaign
4.
Write files by
date,campaign
Enriched
data
5.
Load data
by campaign
campaign=1210
campaign=1319
campaign=1472
date=2021-05-24
date=2021-05-25
date=2021-05-26
{event_date=2021-05-26, userid=uid1, tactic=online_ad, stage=productX_page }
....

1.
Read files
of last day
Data
Lake
2.
Write files by
campaign,date
Campaigns'
marts
3.
Read files
per campaign
4.
Write files by
date,campaign
Enriched
data
5.
Load data
by campaign
campaign=1210
campaign=1319
campaign=1472
date=2021-05-24
date=2021-05-25
date=2021-05-26
....
....

Funnel analysis pipeline - querying Druid (SQL)
1.
Read files
of last day
Data
Lake
2.
Write files by
campaign,date
Campaigns'
marts
3.
Read files
per campaign
4.
Write files by
date,campaign
Enriched
data
5.
Load data
by campaign
SELECT APPROX_COUNT_DISTINCT_DS_THETA(THETA_SKETCH_NOT(65536,
THETA_SKETCH_INTERSECT(65536,a,b), THETA_SKETCH_UNION(65536,c,d,e))) as dropoff_sketch
FROM ( SELECT
DS_THETA("user_id_sketch") FILTER (WHERE tactic = 'online_ad') as a,
DS_THETA("user_id_sketch") FILTER (WHERE stage = 'homepage') as b,
DS_THETA("user_id_sketch") FILTER (WHERE stage = 'productX_page') as c,
DS_THETA("user_id_sketch") FILTER (WHERE stage = 'add_to_cart') as d,
DS_THETA("user_id_sketch") FILTER (WHERE stage = 'checkout') as e
FROM campaign_1472
WHERE stage in ('homepage','productX_page','add_to_cart','checkout')
AND tactic = 'online_ad'
AND __time BETWEEN '2021-05-01T00:00:00.000' AND '2021-05-26T23:59:59.000' )
subquery
* This specific query should return the estimated number of unique
users for the drop-off between the homepage and product X page

5/1/2021 - 5/26/2021

3,100 - 2,500 = 600

PRODUCT PAGE
0.6K UUs
...
HOMEPAGE
3.1K UUs
2.5K
Drop-off
ONLINE AD
8.1M UUs

PRODUCT PAGE
0.6K UUs
...
HOMEPAGE
3.1K UUs
ONLINE AD
8.1M UUs
2.5K
Drop-off

A few tips
● Use Druid with Theta Sketch for fast approximate count distinct
○ Allows set operations (intersection/union/negation)
● Use Spark to pre-process incoming events
○ Allows you to take into account only events that happened in the
pre-defined order of the funnel
○ Check out Etti’s “Optimizing Spark-based data pipelines” talk
(video - tinyurl.com/7hvyxtc8, slides - tinyurl.com/3rvc9mus)
● Optimize your ingestion process
○ Write Theta Sketch objects from Spark app
○ Load to Druid using isInputThetaSketch=true flag

What have we learned?
● Funnel analysis
○ Very important for advertisers
○ Not easy to solve technically (especially if chronological order of events matters)

● Funnel analysis
● Druid is a very powerful tool for real-time analytics
○ Highly scalable, can ingest and store trillions of events, and serve analytic queries in
sub-second
○ Used for many different use-cases

● Funnel analysis
● Druid is a very powerful tool for real-time analytics
○ Highly scalable, can ingest and store trillions of events, and serve analytic queries in
sub-second
○ Used for many different use-cases
● Combining Apache Spark, Druid and DataSkecthes FTW!
○ Pre-process events before ingesting into Druid
○ Decide how to handle out-of-order events

DRUID
ES
Want to know more?
● Women in Big Data
○ A world-wide program that aims :
■ To inspire, connect, grow, and champion success of women in the Big Data & analytics field
○ 30+ chapters and 17,000+ members world-wide
○ Everyone can join (regardless of gender), so find a chapter near you -
www.womeninbigdata.org/membership/
● Conference talks
○ Casting the Spell: Druid in Practice (Berlin Buzzwords, June 17th 2021) - tinyurl.com/559hufnj
○ Migrating Airflow-based Spark Jobs to K8s (Data+AI Summit Europe 2020) - tinyurl.com/cbm42mn8
● Our Tech Blog - medium.com/nmc-techblog
○ Data Retention and Deletion in Apache Druid - tinyurl.com/yymrvrn2

THANK YOU
Etti Gur Etti Gur
Itai Yaffe Itai Yaffe

Feedback
Your feedback is important to us.
Don’t forget to rate and review the sessions.

Funnel Analysis with Apache Spark and Druid

Recommended

More Related Content

What's hot (20)

Similar to Funnel Analysis with Apache Spark and Druid (20)

More from Databricks (20)

Recently uploaded (20)

Funnel Analysis with Apache Spark and Druid