SlideShare a Scribd company logo
REAL-TIME DATA PROCESSING AT RTB HOUSEREAL-TIME DATA PROCESSING AT RTB HOUSE
BIG DATA TECHNOLOGY MOSCOW 2018
OCTOBER 10-11, 2018
BIG DATA TECHNOLOGY MOSCOW 2018
OCTOBER 10-11, 2018
ARCHITECTURE & LESSONS LEARNED
BARTOSZ ŁOŚ
REAL-TIME DATA PROCESSING AT RTB HOUSE
TABLE OF CONTENTS
Agenda:
- our rtb platform
- the first iteration: mutable structures
- the second iteration: data-flow
- the third iteration: immutable streams of events
- the fourth iteration: multi-dc architecture
- the current iteration: kafka workers
- summary
02/30
OUR RTB PLATFORM
OUR RTB PLATFORM: THE CONTEXT 04/30
Bid requests:
2M/s (peak)
~30 SSP networks
<50-100ms
User events:
1.5B tags/day
350M impressions/day
3.5M clicks/day
1.5M conversions/day
Other events:
bidlogs, accesslogs,
domain events etc.
OUR RTB PLATFORM: DATA PROCESSING NUMBERS
Kafka:
- up to 250K+ messages per second
- 50TB+ processed data every day
- 6 clusters in 4 datacenters
- 26 Kafka brokers
- 85 topics, 5000+ partitions
Docker (processing components only):
- 44 engines
- 1408 cpu cores, 5.5TB ram
- 800+ containers
05/30
HDFS:
- 2PB+ data, up to 10GB/s
BigQuery:
- 1PB+ data, up to 10GB/min
Elasticsearch:
- 40TB data, up to 50K events/s
Aerospike (processing only):
- 80TB data, up to 8K events/s
THE FIRST ITERATION
THE 1ST ITERATION: MUTABLE IMPRESSIONS 07/30
THE 1ST ITERATION: DRAWBACKS
Issues:
- long, overloading data migrations (30 days back)
- complex servlets' logic, inability to reprocess
- inflexible, various schemas
- single-DC
- inconsistencies
08/30
THE SECOND ITERATION:
DATA-FLOW
THE 2ND ITERATION: THE 1ST DATA-FLOW ARCHITECTURE 10/30
THE 2ND ITERATION: DISTRIBUTED LOG
Why Apache Kafka:
- distributed log
- topics partitioning
- partition replication
- log retention
- stateless
- efficient data consuming
11/30
THE 2ND ITERATION: BATCH LOADING
Why Apache Camus:
- "Kafka to HDFS" pipeline
- batch tool
- map-reduce jobs
- storing offsets in log files
- data partitioning
12/30
THE 2ND ITERATION: AVRO & SCHEMA VERSIONING
Why Apache Avro:
- compact, efficient format
- schema: JSON format, payload: binary format
- self-describing container files
- rich data structures
- schema changes support, reader & writer schemas
Our approach:
- Kafka's messages and HDFS files
- schema registry
- avro-fastserde
13/30
(github.com/RTBHOUSE/avro-fastserde)
THE 2ND ITERATION: ACCURATE STATISTICS
Why Apache Storm:
- real-time processing
- streams of tuples, topologies
- fault-tolerance
Why Trident:
- transactions, exactly-once processing
- microbatches (latency & throughput)
14/30
THE 2ND ITERATION: STATS-COUNTER TOPOLOGY 15/30
THE 2ND ITERATION: DRAWBACKS
Hybrid architecture:
- aggregates (real-time)
- raw events (2-hour batches)
- joined events (end-of-day batch jobs)
Other issues:
- Hive joins
- mutable events
- servlets' complex logic
16/30
THE THIRD ITERATION:
NEW APPROACH
THE 3RD ITERATION: NEW APPROACH
{ "IMPRESSION”:
"URL”,
"TIME”,
"CREATIVE”,
...
"CLICKS”,
"CONVERSIONS”
}
{ "CLICK”:
"TIME”,
"IMPRESSION_ID”,
...
"IMPRESSION”
}
{ "CONVERSION”:
"TIME”,
"CLICK_ID”,
...
"IMPRESSION”,
"CLICK”
}
New approach:
- real-time processing
- publishing light events
- immutable streams of events
18/30
THE 3RD ITERATION: HIGH-LEVEL ARCHITECTURE 19/30
THE 3RD ITERATION: DATA-FLOW TOPOLOGY 20/30
THE FOURTH ITERATION:
MULTI-DC
THE 4TH ITERATION: NEW REQUIREMENTS
Main changes:
- 5-6x larger scale:
> from 350K to 2M bid requests/s within 1.5 years
- full multi-dc architecture:
> merging streams of events
> synchronization of user profiles
- end-to-end exactly-once processing:
> at-least-once output semantics + deduplication
- a few better components:
> merger
> new stats-counter, new data-flow
> dispatcher & loader
> logstash
22/30
THE 4TH ITERATION: MULTI-DC ARCHITECTURE 23/30
THE 4TH ITERATION: NEW DATA-FLOW ON KAFKA STREAMS 24/30
(picture from kafka.apache.org)
Why Kafka Streams:
- fully embedded library with no stream
processing cluster
- no external dependencies
- Kafka's parallelism model and group
membership mechanism
- event-at-a-time processing
(not microbatch)
- exactly-once processing semantics
(but at-least-once was good enough)
THE 4TH ITERATION: MERGER ON KAFKA CONSUMER API 25/30
THE CURRENT ITERATION:
KAFKA WORKERS
THE 5TH ITERATION: KAFKA WORKERS
Main features:
- higher level of distribution
- possibility to pause and resume processing for given partition
- asynchronous processing
- tighter control of offsets commits
- backpressure
- at-least-once semantics
- processing timeouts
- handling failures
- multiple consumers (in progress)
- kafka-to-kafka, hdfs, bigquery, elasticsearch connectors (in progress)
27/30
(github.com/RTBHOUSE/kafka-workers)
THE 5TH ITERATION: KAFKA WORKERS ARCHITECTURE 28/30
SUMMARY
What we have achieved:
- platform monitoring
- much more stable platform
- higher quality of data processing
- HDFS & BigQuery & Elasticsearch streaming
- multi-DC architecture and data synchronization
- high scalability
- better data-flow monitoring, deployment & maintenance
29/30
REAL-TIME DATA PROCESSING AT RTB HOUSEREAL-TIME DATA PROCESSING AT RTB HOUSE
BIG DATA TECHNOLOGY MOSCOW 2018
OCTOBER 10-11, 2018
THANK YOU FOR YOUR
ATTENTION
Ad

More Related Content

What's hot (20)

Product Launch Go To Market Strategy PowerPoint Presentation Slides
Product Launch Go To Market Strategy PowerPoint Presentation SlidesProduct Launch Go To Market Strategy PowerPoint Presentation Slides
Product Launch Go To Market Strategy PowerPoint Presentation Slides
SlideTeam
 
Scaling to $100 Million
Scaling to $100 MillionScaling to $100 Million
Scaling to $100 Million
Bessemer Venture Partners
 
Salesforce
SalesforceSalesforce
Salesforce
maheswara reddy
 
CDP - 101 Everything you need to know about Customer Data Platforms
CDP - 101 Everything you need to know about Customer Data PlatformsCDP - 101 Everything you need to know about Customer Data Platforms
CDP - 101 Everything you need to know about Customer Data Platforms
Eddy Widerker
 
Five building blocks of digital transformation
Five building blocks of digital transformation Five building blocks of digital transformation
Five building blocks of digital transformation
Maziar Ebrahimi
 
Go-to-Market Best Practices for Startups
Go-to-Market Best Practices for StartupsGo-to-Market Best Practices for Startups
Go-to-Market Best Practices for Startups
a16z
 
Here’s The Deck Andy Raskin Called “The Greatest Sales Pitch I’ve Seen All Year”
Here’s The Deck Andy Raskin Called “The Greatest Sales Pitch I’ve Seen All Year”Here’s The Deck Andy Raskin Called “The Greatest Sales Pitch I’ve Seen All Year”
Here’s The Deck Andy Raskin Called “The Greatest Sales Pitch I’ve Seen All Year”
Drift
 
INDUSTRY 5.0
INDUSTRY 5.0 INDUSTRY 5.0
INDUSTRY 5.0
NAVEENSAIASHISHM
 
Investor pitch deck template
Investor pitch deck templateInvestor pitch deck template
Investor pitch deck template
Vasu Mullapudi
 
Lean Analytics @ MicroConf
Lean Analytics @ MicroConfLean Analytics @ MicroConf
Lean Analytics @ MicroConf
Lean Analytics
 
Go to market planning
Go to market planningGo to market planning
Go to market planning
Mike McCormac
 
Salesforce Proposal to 3M
Salesforce Proposal to 3MSalesforce Proposal to 3M
Salesforce Proposal to 3M
Anyssa Volarath
 
How to Write a B2B Sales Playbook
How to Write a B2B Sales PlaybookHow to Write a B2B Sales Playbook
How to Write a B2B Sales Playbook
Carrie Morgan
 
Beyond the Gig Economy
Beyond the Gig EconomyBeyond the Gig Economy
Beyond the Gig Economy
Jon Lieber
 
Cracking the PM Interview
Cracking the PM InterviewCracking the PM Interview
Cracking the PM Interview
Gayle McDowell
 
“Show Me the Future of Work” Employee Experience Redefined with AI (Daren Goe...
“Show Me the Future of Work” Employee Experience Redefined with AI (Daren Goe...“Show Me the Future of Work” Employee Experience Redefined with AI (Daren Goe...
“Show Me the Future of Work” Employee Experience Redefined with AI (Daren Goe...
Digital Workplace Experience
 
Revenue Ops: Our Proven Framework for Massive Pipeline
Revenue Ops: Our Proven Framework for Massive PipelineRevenue Ops: Our Proven Framework for Massive Pipeline
Revenue Ops: Our Proven Framework for Massive Pipeline
Sales Hacker
 
Slack, the future workplace
Slack, the future workplaceSlack, the future workplace
Slack, the future workplace
Fabernovel
 
The Only 10 Slides You Need in Your Pitch Deck from The Art of the Start 2.0
The Only 10 Slides You Need in Your Pitch Deck from The Art of the Start 2.0The Only 10 Slides You Need in Your Pitch Deck from The Art of the Start 2.0
The Only 10 Slides You Need in Your Pitch Deck from The Art of the Start 2.0
Guy Kawasaki
 
Kellogg The Top 5 Scale-Up Mistakes.pdf
Kellogg The Top 5 Scale-Up Mistakes.pdfKellogg The Top 5 Scale-Up Mistakes.pdf
Kellogg The Top 5 Scale-Up Mistakes.pdf
Dave Kellogg
 
Product Launch Go To Market Strategy PowerPoint Presentation Slides
Product Launch Go To Market Strategy PowerPoint Presentation SlidesProduct Launch Go To Market Strategy PowerPoint Presentation Slides
Product Launch Go To Market Strategy PowerPoint Presentation Slides
SlideTeam
 
CDP - 101 Everything you need to know about Customer Data Platforms
CDP - 101 Everything you need to know about Customer Data PlatformsCDP - 101 Everything you need to know about Customer Data Platforms
CDP - 101 Everything you need to know about Customer Data Platforms
Eddy Widerker
 
Five building blocks of digital transformation
Five building blocks of digital transformation Five building blocks of digital transformation
Five building blocks of digital transformation
Maziar Ebrahimi
 
Go-to-Market Best Practices for Startups
Go-to-Market Best Practices for StartupsGo-to-Market Best Practices for Startups
Go-to-Market Best Practices for Startups
a16z
 
Here’s The Deck Andy Raskin Called “The Greatest Sales Pitch I’ve Seen All Year”
Here’s The Deck Andy Raskin Called “The Greatest Sales Pitch I’ve Seen All Year”Here’s The Deck Andy Raskin Called “The Greatest Sales Pitch I’ve Seen All Year”
Here’s The Deck Andy Raskin Called “The Greatest Sales Pitch I’ve Seen All Year”
Drift
 
Investor pitch deck template
Investor pitch deck templateInvestor pitch deck template
Investor pitch deck template
Vasu Mullapudi
 
Lean Analytics @ MicroConf
Lean Analytics @ MicroConfLean Analytics @ MicroConf
Lean Analytics @ MicroConf
Lean Analytics
 
Go to market planning
Go to market planningGo to market planning
Go to market planning
Mike McCormac
 
Salesforce Proposal to 3M
Salesforce Proposal to 3MSalesforce Proposal to 3M
Salesforce Proposal to 3M
Anyssa Volarath
 
How to Write a B2B Sales Playbook
How to Write a B2B Sales PlaybookHow to Write a B2B Sales Playbook
How to Write a B2B Sales Playbook
Carrie Morgan
 
Beyond the Gig Economy
Beyond the Gig EconomyBeyond the Gig Economy
Beyond the Gig Economy
Jon Lieber
 
Cracking the PM Interview
Cracking the PM InterviewCracking the PM Interview
Cracking the PM Interview
Gayle McDowell
 
“Show Me the Future of Work” Employee Experience Redefined with AI (Daren Goe...
“Show Me the Future of Work” Employee Experience Redefined with AI (Daren Goe...“Show Me the Future of Work” Employee Experience Redefined with AI (Daren Goe...
“Show Me the Future of Work” Employee Experience Redefined with AI (Daren Goe...
Digital Workplace Experience
 
Revenue Ops: Our Proven Framework for Massive Pipeline
Revenue Ops: Our Proven Framework for Massive PipelineRevenue Ops: Our Proven Framework for Massive Pipeline
Revenue Ops: Our Proven Framework for Massive Pipeline
Sales Hacker
 
Slack, the future workplace
Slack, the future workplaceSlack, the future workplace
Slack, the future workplace
Fabernovel
 
The Only 10 Slides You Need in Your Pitch Deck from The Art of the Start 2.0
The Only 10 Slides You Need in Your Pitch Deck from The Art of the Start 2.0The Only 10 Slides You Need in Your Pitch Deck from The Art of the Start 2.0
The Only 10 Slides You Need in Your Pitch Deck from The Art of the Start 2.0
Guy Kawasaki
 
Kellogg The Top 5 Scale-Up Mistakes.pdf
Kellogg The Top 5 Scale-Up Mistakes.pdfKellogg The Top 5 Scale-Up Mistakes.pdf
Kellogg The Top 5 Scale-Up Mistakes.pdf
Dave Kellogg
 

Similar to Real-Time Data Processing at RTB House – Architecture & Lessons Learned (20)

Real Time Data Processing at RTB House - Bartosz Łoś
Real Time Data Processing at RTB House - Bartosz ŁośReal Time Data Processing at RTB House - Bartosz Łoś
Real Time Data Processing at RTB House - Bartosz Łoś
Evention
 
Apache Kafka with Spark Streaming: Real-time Analytics Redefined
Apache Kafka with Spark Streaming: Real-time Analytics RedefinedApache Kafka with Spark Streaming: Real-time Analytics Redefined
Apache Kafka with Spark Streaming: Real-time Analytics Redefined
Edureka!
 
Event Driven Microservices
Event Driven MicroservicesEvent Driven Microservices
Event Driven Microservices
Fabrizio Fortino
 
Introduction to apache kafka, confluent and why they matter
Introduction to apache kafka, confluent and why they matterIntroduction to apache kafka, confluent and why they matter
Introduction to apache kafka, confluent and why they matter
Paolo Castagna
 
Streaming ETL with Apache Kafka and KSQL
Streaming ETL with Apache Kafka and KSQLStreaming ETL with Apache Kafka and KSQL
Streaming ETL with Apache Kafka and KSQL
Nick Dearden
 
Using a Fast Operational Database to Build Real-time Streaming Aggregations
Using a Fast Operational Database to Build Real-time Streaming AggregationsUsing a Fast Operational Database to Build Real-time Streaming Aggregations
Using a Fast Operational Database to Build Real-time Streaming Aggregations
VoltDB
 
Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...
Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...
Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...
DataStax Academy
 
Real-time Data Pipeline: Kafka Streams / Kafka Connect versus Spark Streaming
Real-time Data Pipeline: Kafka Streams / Kafka Connect versus Spark StreamingReal-time Data Pipeline: Kafka Streams / Kafka Connect versus Spark Streaming
Real-time Data Pipeline: Kafka Streams / Kafka Connect versus Spark Streaming
Abdelhamide EL ARIB
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Guido Schmutz
 
Apache Kafka - A modern Stream Processing Platform
Apache Kafka - A modern Stream Processing PlatformApache Kafka - A modern Stream Processing Platform
Apache Kafka - A modern Stream Processing Platform
Guido Schmutz
 
Spark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsSpark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka Streams
Guido Schmutz
 
Apache Kafka
Apache KafkaApache Kafka
Apache Kafka
Joe Stein
 
Big Data Open Source Security LLC: Realtime log analysis with Mesos, Docker, ...
Big Data Open Source Security LLC: Realtime log analysis with Mesos, Docker, ...Big Data Open Source Security LLC: Realtime log analysis with Mesos, Docker, ...
Big Data Open Source Security LLC: Realtime log analysis with Mesos, Docker, ...
DataStax Academy
 
Kafka Connect & Kafka Streams/KSQL - the ecosystem around Kafka
Kafka Connect & Kafka Streams/KSQL - the ecosystem around KafkaKafka Connect & Kafka Streams/KSQL - the ecosystem around Kafka
Kafka Connect & Kafka Streams/KSQL - the ecosystem around Kafka
Guido Schmutz
 
Kafka Connect and Streams (Concepts, Architecture, Features)
Kafka Connect and Streams (Concepts, Architecture, Features)Kafka Connect and Streams (Concepts, Architecture, Features)
Kafka Connect and Streams (Concepts, Architecture, Features)
Kai Wähner
 
Connect K of SMACK:pykafka, kafka-python or?
Connect K of SMACK:pykafka, kafka-python or?Connect K of SMACK:pykafka, kafka-python or?
Connect K of SMACK:pykafka, kafka-python or?
Micron Technology
 
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
DataWorks Summit/Hadoop Summit
 
Real-Time Log Analysis with Apache Mesos, Kafka and Cassandra
Real-Time Log Analysis with Apache Mesos, Kafka and CassandraReal-Time Log Analysis with Apache Mesos, Kafka and Cassandra
Real-Time Log Analysis with Apache Mesos, Kafka and Cassandra
Joe Stein
 
Microservices with Spring 5 Webflux - jProfessionals
Microservices  with Spring 5 Webflux - jProfessionalsMicroservices  with Spring 5 Webflux - jProfessionals
Microservices with Spring 5 Webflux - jProfessionals
Trayan Iliev
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Guido Schmutz
 
Real Time Data Processing at RTB House - Bartosz Łoś
Real Time Data Processing at RTB House - Bartosz ŁośReal Time Data Processing at RTB House - Bartosz Łoś
Real Time Data Processing at RTB House - Bartosz Łoś
Evention
 
Apache Kafka with Spark Streaming: Real-time Analytics Redefined
Apache Kafka with Spark Streaming: Real-time Analytics RedefinedApache Kafka with Spark Streaming: Real-time Analytics Redefined
Apache Kafka with Spark Streaming: Real-time Analytics Redefined
Edureka!
 
Event Driven Microservices
Event Driven MicroservicesEvent Driven Microservices
Event Driven Microservices
Fabrizio Fortino
 
Introduction to apache kafka, confluent and why they matter
Introduction to apache kafka, confluent and why they matterIntroduction to apache kafka, confluent and why they matter
Introduction to apache kafka, confluent and why they matter
Paolo Castagna
 
Streaming ETL with Apache Kafka and KSQL
Streaming ETL with Apache Kafka and KSQLStreaming ETL with Apache Kafka and KSQL
Streaming ETL with Apache Kafka and KSQL
Nick Dearden
 
Using a Fast Operational Database to Build Real-time Streaming Aggregations
Using a Fast Operational Database to Build Real-time Streaming AggregationsUsing a Fast Operational Database to Build Real-time Streaming Aggregations
Using a Fast Operational Database to Build Real-time Streaming Aggregations
VoltDB
 
Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...
Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...
Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...
DataStax Academy
 
Real-time Data Pipeline: Kafka Streams / Kafka Connect versus Spark Streaming
Real-time Data Pipeline: Kafka Streams / Kafka Connect versus Spark StreamingReal-time Data Pipeline: Kafka Streams / Kafka Connect versus Spark Streaming
Real-time Data Pipeline: Kafka Streams / Kafka Connect versus Spark Streaming
Abdelhamide EL ARIB
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Guido Schmutz
 
Apache Kafka - A modern Stream Processing Platform
Apache Kafka - A modern Stream Processing PlatformApache Kafka - A modern Stream Processing Platform
Apache Kafka - A modern Stream Processing Platform
Guido Schmutz
 
Spark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsSpark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka Streams
Guido Schmutz
 
Apache Kafka
Apache KafkaApache Kafka
Apache Kafka
Joe Stein
 
Big Data Open Source Security LLC: Realtime log analysis with Mesos, Docker, ...
Big Data Open Source Security LLC: Realtime log analysis with Mesos, Docker, ...Big Data Open Source Security LLC: Realtime log analysis with Mesos, Docker, ...
Big Data Open Source Security LLC: Realtime log analysis with Mesos, Docker, ...
DataStax Academy
 
Kafka Connect & Kafka Streams/KSQL - the ecosystem around Kafka
Kafka Connect & Kafka Streams/KSQL - the ecosystem around KafkaKafka Connect & Kafka Streams/KSQL - the ecosystem around Kafka
Kafka Connect & Kafka Streams/KSQL - the ecosystem around Kafka
Guido Schmutz
 
Kafka Connect and Streams (Concepts, Architecture, Features)
Kafka Connect and Streams (Concepts, Architecture, Features)Kafka Connect and Streams (Concepts, Architecture, Features)
Kafka Connect and Streams (Concepts, Architecture, Features)
Kai Wähner
 
Connect K of SMACK:pykafka, kafka-python or?
Connect K of SMACK:pykafka, kafka-python or?Connect K of SMACK:pykafka, kafka-python or?
Connect K of SMACK:pykafka, kafka-python or?
Micron Technology
 
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
DataWorks Summit/Hadoop Summit
 
Real-Time Log Analysis with Apache Mesos, Kafka and Cassandra
Real-Time Log Analysis with Apache Mesos, Kafka and CassandraReal-Time Log Analysis with Apache Mesos, Kafka and Cassandra
Real-Time Log Analysis with Apache Mesos, Kafka and Cassandra
Joe Stein
 
Microservices with Spring 5 Webflux - jProfessionals
Microservices  with Spring 5 Webflux - jProfessionalsMicroservices  with Spring 5 Webflux - jProfessionals
Microservices with Spring 5 Webflux - jProfessionals
Trayan Iliev
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Guido Schmutz
 
Ad

Recently uploaded (20)

34 Mobile Electronic Commerce_ Foundations, Development, and Applications (20...
34 Mobile Electronic Commerce_ Foundations, Development, and Applications (20...34 Mobile Electronic Commerce_ Foundations, Development, and Applications (20...
34 Mobile Electronic Commerce_ Foundations, Development, and Applications (20...
Nguyễn Minh
 
Breaking Down the Latest Spectrum Internet Plans.pdf
Breaking Down the Latest Spectrum Internet Plans.pdfBreaking Down the Latest Spectrum Internet Plans.pdf
Breaking Down the Latest Spectrum Internet Plans.pdf
Internet Bundle Now
 
34 E-commerce and M-commerce technologies (P. Candace Deans 2006).pdf
34 E-commerce and M-commerce technologies (P. Candace Deans 2006).pdf34 E-commerce and M-commerce technologies (P. Candace Deans 2006).pdf
34 E-commerce and M-commerce technologies (P. Candace Deans 2006).pdf
Nguyễn Minh
 
How to Install & Activate ListGrabber - eGrabber
How to Install & Activate ListGrabber - eGrabberHow to Install & Activate ListGrabber - eGrabber
How to Install & Activate ListGrabber - eGrabber
eGrabber
 
Cloud-to-cloud Migration presentation.pptx
Cloud-to-cloud Migration presentation.pptxCloud-to-cloud Migration presentation.pptx
Cloud-to-cloud Migration presentation.pptx
marketing140789
 
APNIC Policy Update and Participation, presented at TWNIC 43rd IP Open Policy...
APNIC Policy Update and Participation, presented at TWNIC 43rd IP Open Policy...APNIC Policy Update and Participation, presented at TWNIC 43rd IP Open Policy...
APNIC Policy Update and Participation, presented at TWNIC 43rd IP Open Policy...
APNIC
 
34 Advances in Mobile Commerce Technologies (2003).pdf
34 Advances in Mobile Commerce Technologies (2003).pdf34 Advances in Mobile Commerce Technologies (2003).pdf
34 Advances in Mobile Commerce Technologies (2003).pdf
Nguyễn Minh
 
34 Global Mobile Commerce_ Strategies, Implementation and Case Studies (Premi...
34 Global Mobile Commerce_ Strategies, Implementation and Case Studies (Premi...34 Global Mobile Commerce_ Strategies, Implementation and Case Studies (Premi...
34 Global Mobile Commerce_ Strategies, Implementation and Case Studies (Premi...
Nguyễn Minh
 
GiacomoVacca - WebRTC - troubleshooting media negotiation.pdf
GiacomoVacca - WebRTC - troubleshooting media negotiation.pdfGiacomoVacca - WebRTC - troubleshooting media negotiation.pdf
GiacomoVacca - WebRTC - troubleshooting media negotiation.pdf
Giacomo Vacca
 
Global Networking Trends, presented at TWNIC 43rd IP Open Policy Meeting
Global Networking Trends, presented at TWNIC 43rd IP Open Policy MeetingGlobal Networking Trends, presented at TWNIC 43rd IP Open Policy Meeting
Global Networking Trends, presented at TWNIC 43rd IP Open Policy Meeting
APNIC
 
ProjectArtificial Intelligence Good or Evil.pptx
ProjectArtificial Intelligence Good or Evil.pptxProjectArtificial Intelligence Good or Evil.pptx
ProjectArtificial Intelligence Good or Evil.pptx
OlenaKotovska
 
水印成绩单加拿大Mohawk文凭莫霍克学院在读证明毕业证
水印成绩单加拿大Mohawk文凭莫霍克学院在读证明毕业证水印成绩单加拿大Mohawk文凭莫霍克学院在读证明毕业证
水印成绩单加拿大Mohawk文凭莫霍克学院在读证明毕业证
Taqyea
 
Presentation Mehdi Monitorama 2022 Cancer and Monitoring
Presentation Mehdi Monitorama 2022 Cancer and MonitoringPresentation Mehdi Monitorama 2022 Cancer and Monitoring
Presentation Mehdi Monitorama 2022 Cancer and Monitoring
mdaoudi
 
IoT PPT introduction to internet of things
IoT PPT introduction to internet of thingsIoT PPT introduction to internet of things
IoT PPT introduction to internet of things
VaishnaviPatil3995
 
Internet Coordination Policy 2 (ICP-2) Review
Internet Coordination Policy 2 (ICP-2) ReviewInternet Coordination Policy 2 (ICP-2) Review
Internet Coordination Policy 2 (ICP-2) Review
APNIC
 
CompTIA-Security-Study-Guide-with-over-500-Practice-Test-Questions-Exam-SY0-7...
CompTIA-Security-Study-Guide-with-over-500-Practice-Test-Questions-Exam-SY0-7...CompTIA-Security-Study-Guide-with-over-500-Practice-Test-Questions-Exam-SY0-7...
CompTIA-Security-Study-Guide-with-over-500-Practice-Test-Questions-Exam-SY0-7...
emestica1
 
学生卡英国RCA毕业证皇家艺术学院电子毕业证学历证书
学生卡英国RCA毕业证皇家艺术学院电子毕业证学历证书学生卡英国RCA毕业证皇家艺术学院电子毕业证学历证书
学生卡英国RCA毕业证皇家艺术学院电子毕业证学历证书
Taqyea
 
23 Introduction to E-Commerce ( PDFDrive ) (1).pdf
23 Introduction to E-Commerce ( PDFDrive ) (1).pdf23 Introduction to E-Commerce ( PDFDrive ) (1).pdf
23 Introduction to E-Commerce ( PDFDrive ) (1).pdf
Nguyễn Minh
 
Fractures In Chronic Kidney Disease Patients - Copy (3).pptx
Fractures In Chronic Kidney Disease Patients - Copy (3).pptxFractures In Chronic Kidney Disease Patients - Copy (3).pptx
Fractures In Chronic Kidney Disease Patients - Copy (3).pptx
ChaitanJaunky1
 
美国文凭明尼苏达大学莫里斯分校毕业证范本UMM学位证书
美国文凭明尼苏达大学莫里斯分校毕业证范本UMM学位证书美国文凭明尼苏达大学莫里斯分校毕业证范本UMM学位证书
美国文凭明尼苏达大学莫里斯分校毕业证范本UMM学位证书
Taqyea
 
34 Mobile Electronic Commerce_ Foundations, Development, and Applications (20...
34 Mobile Electronic Commerce_ Foundations, Development, and Applications (20...34 Mobile Electronic Commerce_ Foundations, Development, and Applications (20...
34 Mobile Electronic Commerce_ Foundations, Development, and Applications (20...
Nguyễn Minh
 
Breaking Down the Latest Spectrum Internet Plans.pdf
Breaking Down the Latest Spectrum Internet Plans.pdfBreaking Down the Latest Spectrum Internet Plans.pdf
Breaking Down the Latest Spectrum Internet Plans.pdf
Internet Bundle Now
 
34 E-commerce and M-commerce technologies (P. Candace Deans 2006).pdf
34 E-commerce and M-commerce technologies (P. Candace Deans 2006).pdf34 E-commerce and M-commerce technologies (P. Candace Deans 2006).pdf
34 E-commerce and M-commerce technologies (P. Candace Deans 2006).pdf
Nguyễn Minh
 
How to Install & Activate ListGrabber - eGrabber
How to Install & Activate ListGrabber - eGrabberHow to Install & Activate ListGrabber - eGrabber
How to Install & Activate ListGrabber - eGrabber
eGrabber
 
Cloud-to-cloud Migration presentation.pptx
Cloud-to-cloud Migration presentation.pptxCloud-to-cloud Migration presentation.pptx
Cloud-to-cloud Migration presentation.pptx
marketing140789
 
APNIC Policy Update and Participation, presented at TWNIC 43rd IP Open Policy...
APNIC Policy Update and Participation, presented at TWNIC 43rd IP Open Policy...APNIC Policy Update and Participation, presented at TWNIC 43rd IP Open Policy...
APNIC Policy Update and Participation, presented at TWNIC 43rd IP Open Policy...
APNIC
 
34 Advances in Mobile Commerce Technologies (2003).pdf
34 Advances in Mobile Commerce Technologies (2003).pdf34 Advances in Mobile Commerce Technologies (2003).pdf
34 Advances in Mobile Commerce Technologies (2003).pdf
Nguyễn Minh
 
34 Global Mobile Commerce_ Strategies, Implementation and Case Studies (Premi...
34 Global Mobile Commerce_ Strategies, Implementation and Case Studies (Premi...34 Global Mobile Commerce_ Strategies, Implementation and Case Studies (Premi...
34 Global Mobile Commerce_ Strategies, Implementation and Case Studies (Premi...
Nguyễn Minh
 
GiacomoVacca - WebRTC - troubleshooting media negotiation.pdf
GiacomoVacca - WebRTC - troubleshooting media negotiation.pdfGiacomoVacca - WebRTC - troubleshooting media negotiation.pdf
GiacomoVacca - WebRTC - troubleshooting media negotiation.pdf
Giacomo Vacca
 
Global Networking Trends, presented at TWNIC 43rd IP Open Policy Meeting
Global Networking Trends, presented at TWNIC 43rd IP Open Policy MeetingGlobal Networking Trends, presented at TWNIC 43rd IP Open Policy Meeting
Global Networking Trends, presented at TWNIC 43rd IP Open Policy Meeting
APNIC
 
ProjectArtificial Intelligence Good or Evil.pptx
ProjectArtificial Intelligence Good or Evil.pptxProjectArtificial Intelligence Good or Evil.pptx
ProjectArtificial Intelligence Good or Evil.pptx
OlenaKotovska
 
水印成绩单加拿大Mohawk文凭莫霍克学院在读证明毕业证
水印成绩单加拿大Mohawk文凭莫霍克学院在读证明毕业证水印成绩单加拿大Mohawk文凭莫霍克学院在读证明毕业证
水印成绩单加拿大Mohawk文凭莫霍克学院在读证明毕业证
Taqyea
 
Presentation Mehdi Monitorama 2022 Cancer and Monitoring
Presentation Mehdi Monitorama 2022 Cancer and MonitoringPresentation Mehdi Monitorama 2022 Cancer and Monitoring
Presentation Mehdi Monitorama 2022 Cancer and Monitoring
mdaoudi
 
IoT PPT introduction to internet of things
IoT PPT introduction to internet of thingsIoT PPT introduction to internet of things
IoT PPT introduction to internet of things
VaishnaviPatil3995
 
Internet Coordination Policy 2 (ICP-2) Review
Internet Coordination Policy 2 (ICP-2) ReviewInternet Coordination Policy 2 (ICP-2) Review
Internet Coordination Policy 2 (ICP-2) Review
APNIC
 
CompTIA-Security-Study-Guide-with-over-500-Practice-Test-Questions-Exam-SY0-7...
CompTIA-Security-Study-Guide-with-over-500-Practice-Test-Questions-Exam-SY0-7...CompTIA-Security-Study-Guide-with-over-500-Practice-Test-Questions-Exam-SY0-7...
CompTIA-Security-Study-Guide-with-over-500-Practice-Test-Questions-Exam-SY0-7...
emestica1
 
学生卡英国RCA毕业证皇家艺术学院电子毕业证学历证书
学生卡英国RCA毕业证皇家艺术学院电子毕业证学历证书学生卡英国RCA毕业证皇家艺术学院电子毕业证学历证书
学生卡英国RCA毕业证皇家艺术学院电子毕业证学历证书
Taqyea
 
23 Introduction to E-Commerce ( PDFDrive ) (1).pdf
23 Introduction to E-Commerce ( PDFDrive ) (1).pdf23 Introduction to E-Commerce ( PDFDrive ) (1).pdf
23 Introduction to E-Commerce ( PDFDrive ) (1).pdf
Nguyễn Minh
 
Fractures In Chronic Kidney Disease Patients - Copy (3).pptx
Fractures In Chronic Kidney Disease Patients - Copy (3).pptxFractures In Chronic Kidney Disease Patients - Copy (3).pptx
Fractures In Chronic Kidney Disease Patients - Copy (3).pptx
ChaitanJaunky1
 
美国文凭明尼苏达大学莫里斯分校毕业证范本UMM学位证书
美国文凭明尼苏达大学莫里斯分校毕业证范本UMM学位证书美国文凭明尼苏达大学莫里斯分校毕业证范本UMM学位证书
美国文凭明尼苏达大学莫里斯分校毕业证范本UMM学位证书
Taqyea
 
Ad

Real-Time Data Processing at RTB House – Architecture & Lessons Learned

  • 1. REAL-TIME DATA PROCESSING AT RTB HOUSEREAL-TIME DATA PROCESSING AT RTB HOUSE BIG DATA TECHNOLOGY MOSCOW 2018 OCTOBER 10-11, 2018 BIG DATA TECHNOLOGY MOSCOW 2018 OCTOBER 10-11, 2018 ARCHITECTURE & LESSONS LEARNED BARTOSZ ŁOŚ REAL-TIME DATA PROCESSING AT RTB HOUSE
  • 2. TABLE OF CONTENTS Agenda: - our rtb platform - the first iteration: mutable structures - the second iteration: data-flow - the third iteration: immutable streams of events - the fourth iteration: multi-dc architecture - the current iteration: kafka workers - summary 02/30
  • 4. OUR RTB PLATFORM: THE CONTEXT 04/30 Bid requests: 2M/s (peak) ~30 SSP networks <50-100ms User events: 1.5B tags/day 350M impressions/day 3.5M clicks/day 1.5M conversions/day Other events: bidlogs, accesslogs, domain events etc.
  • 5. OUR RTB PLATFORM: DATA PROCESSING NUMBERS Kafka: - up to 250K+ messages per second - 50TB+ processed data every day - 6 clusters in 4 datacenters - 26 Kafka brokers - 85 topics, 5000+ partitions Docker (processing components only): - 44 engines - 1408 cpu cores, 5.5TB ram - 800+ containers 05/30 HDFS: - 2PB+ data, up to 10GB/s BigQuery: - 1PB+ data, up to 10GB/min Elasticsearch: - 40TB data, up to 50K events/s Aerospike (processing only): - 80TB data, up to 8K events/s
  • 7. THE 1ST ITERATION: MUTABLE IMPRESSIONS 07/30
  • 8. THE 1ST ITERATION: DRAWBACKS Issues: - long, overloading data migrations (30 days back) - complex servlets' logic, inability to reprocess - inflexible, various schemas - single-DC - inconsistencies 08/30
  • 10. THE 2ND ITERATION: THE 1ST DATA-FLOW ARCHITECTURE 10/30
  • 11. THE 2ND ITERATION: DISTRIBUTED LOG Why Apache Kafka: - distributed log - topics partitioning - partition replication - log retention - stateless - efficient data consuming 11/30
  • 12. THE 2ND ITERATION: BATCH LOADING Why Apache Camus: - "Kafka to HDFS" pipeline - batch tool - map-reduce jobs - storing offsets in log files - data partitioning 12/30
  • 13. THE 2ND ITERATION: AVRO & SCHEMA VERSIONING Why Apache Avro: - compact, efficient format - schema: JSON format, payload: binary format - self-describing container files - rich data structures - schema changes support, reader & writer schemas Our approach: - Kafka's messages and HDFS files - schema registry - avro-fastserde 13/30 (github.com/RTBHOUSE/avro-fastserde)
  • 14. THE 2ND ITERATION: ACCURATE STATISTICS Why Apache Storm: - real-time processing - streams of tuples, topologies - fault-tolerance Why Trident: - transactions, exactly-once processing - microbatches (latency & throughput) 14/30
  • 15. THE 2ND ITERATION: STATS-COUNTER TOPOLOGY 15/30
  • 16. THE 2ND ITERATION: DRAWBACKS Hybrid architecture: - aggregates (real-time) - raw events (2-hour batches) - joined events (end-of-day batch jobs) Other issues: - Hive joins - mutable events - servlets' complex logic 16/30
  • 18. THE 3RD ITERATION: NEW APPROACH { "IMPRESSION”: "URL”, "TIME”, "CREATIVE”, ... "CLICKS”, "CONVERSIONS” } { "CLICK”: "TIME”, "IMPRESSION_ID”, ... "IMPRESSION” } { "CONVERSION”: "TIME”, "CLICK_ID”, ... "IMPRESSION”, "CLICK” } New approach: - real-time processing - publishing light events - immutable streams of events 18/30
  • 19. THE 3RD ITERATION: HIGH-LEVEL ARCHITECTURE 19/30
  • 20. THE 3RD ITERATION: DATA-FLOW TOPOLOGY 20/30
  • 22. THE 4TH ITERATION: NEW REQUIREMENTS Main changes: - 5-6x larger scale: > from 350K to 2M bid requests/s within 1.5 years - full multi-dc architecture: > merging streams of events > synchronization of user profiles - end-to-end exactly-once processing: > at-least-once output semantics + deduplication - a few better components: > merger > new stats-counter, new data-flow > dispatcher & loader > logstash 22/30
  • 23. THE 4TH ITERATION: MULTI-DC ARCHITECTURE 23/30
  • 24. THE 4TH ITERATION: NEW DATA-FLOW ON KAFKA STREAMS 24/30 (picture from kafka.apache.org) Why Kafka Streams: - fully embedded library with no stream processing cluster - no external dependencies - Kafka's parallelism model and group membership mechanism - event-at-a-time processing (not microbatch) - exactly-once processing semantics (but at-least-once was good enough)
  • 25. THE 4TH ITERATION: MERGER ON KAFKA CONSUMER API 25/30
  • 27. THE 5TH ITERATION: KAFKA WORKERS Main features: - higher level of distribution - possibility to pause and resume processing for given partition - asynchronous processing - tighter control of offsets commits - backpressure - at-least-once semantics - processing timeouts - handling failures - multiple consumers (in progress) - kafka-to-kafka, hdfs, bigquery, elasticsearch connectors (in progress) 27/30 (github.com/RTBHOUSE/kafka-workers)
  • 28. THE 5TH ITERATION: KAFKA WORKERS ARCHITECTURE 28/30
  • 29. SUMMARY What we have achieved: - platform monitoring - much more stable platform - higher quality of data processing - HDFS & BigQuery & Elasticsearch streaming - multi-DC architecture and data synchronization - high scalability - better data-flow monitoring, deployment & maintenance 29/30
  • 30. REAL-TIME DATA PROCESSING AT RTB HOUSEREAL-TIME DATA PROCESSING AT RTB HOUSE BIG DATA TECHNOLOGY MOSCOW 2018 OCTOBER 10-11, 2018 THANK YOU FOR YOUR ATTENTION
  翻译: