SlideShare a Scribd company logo
Data Science Infrastructure Team, Thanh Tran
Upwork
How to Rebuild Data and ML
Platform using Kinesis, S3,
Spark, MLlib, Databricks,
Airflow and Upwork
#AssignedHashtagGoesHere
Introduction
2
3
WE
I
OUR
Nikolay Melnik
Lead ML Engineer
Ukraine
Dimitris Manikis
Senior Data Engineer
Greece
Artem Moskvin
Data/ML Engineer
Germany
Roman Tkachuk
Senior Data Engineer
Ukraine
Andrei Demus
Data/ML Engineer
Ukraine
Igor Korsunov
ML Engineer
Russia
Anna Lysak
Data/ML Engineer
Ukraine
Yongtao Ma
Senior ML Engineer
Germany
Giannis Koutsoubos
Lead Backend Engineer
Greece
4
● Highest-skilled experts for
the job
● Competitive/lower rate
● Mix of long-term and
project-based staff
QUALITY
COST/EARNING
AGILITY
● Work on cutting-edge projects
● Happy with competitive
compensation + flexibility in
location and work hours
● Work only when they want work
My Team
With Upwork, our new hires AND I are better off!
Me
5
We believe significant
welfare improvements
can be achieved through
data science driven
optimization of the online
labor marketplace.
6
We have the biggest
closed-loop online
dataset of jobs and job
seekers in labor
history.
Contract progress (~1B)
Feedback (~10M)
Web site activity (~10B)
Money transactions (~100M)
Profiles (~10M)
Job Posts (~10M)
Proposals (~100M)
Messages (~100M)
Hiring decisions (~10M)
Contract progress (~1B)
Feedback (~10M)
Web site activity (~10B)
Money transactions (~100M)
7
What do we need to ship data
science products?
8
We need to support an agile data science workflow
to provide quick and validated improvements!
● Data Science analytics
○ Complete and cleansed data, single ground-truth
○ Tools for computing metrics, continuous validation
● Data Science model development
○ Business objects and UI event data
○ Scaling complex data processing and feature computation
○ Discoverability of data and features
○ Batch + live data mismatches
○ Managing, monitoring and versioning of models and experiments
○ Knowledge sharing and code reuse (experiments, model, feature computation pipeline)
○ Flexibility to accommodate variety of ML frameworks
● Data Science model productionalization
○ Minimize differences between trained model and production code
○ Code modularized, tested, integrated into CI/CD workflow
○ Standardized model serving that is scalable, available, high throughput, low latency...
Upwork Data & ML Platform
9
10
11
● Kinesis and Structure Spark Streaming for high throughput live event data processing
● Moving away from traditional DWH solution to distributed Spark-based batch data
processing to avoid performance issues and workload limitations
● Spark MLlib + Tensorflow as core ML libraries to balance the tradeoff between
flexibility and standardized model engineering
● Data processing, feature computation and pipeline retraining jobs scheduled and
orchestrated via Airflow
● Experiment management and model versioning integral part of CI/CD workflow
● Adopt engineer CI/CD workflow to data science using Jenkins, Databricks and Airflow:
standalone model testing + live regression test helps to identify batch and live data
mismatches
● Spark-based pipeline developed by data scientists directly used for model scoring in
production environment
● Microservices for streamlined model serving, scalability, availability...
● Extensive use of Databricks notebook-based documentation of model, experiments
and feature engineering code
● Graphite, ELK and Pagerduty for logging, monitoring and alerts
Batch Data & ML Environment
12
13
14
Live Data & ML Pipeline
15
16
17
CI/CD Workflow
18
19
20
Pitfalls and Lessons Learned
21
• Microservices can lead to data fragmentation and high downstream
processing overhead
• Structured streaming latency when number of Kinesis consumers is high
• Stream-to-stream/stream-to-batch join not suitable for real-time use cases
yet
• Differences between live and batch data
• Differences between trained vs. deployed ML pipeline can be minimized
• CI/CD needs to be customized to support data science workflow and
artefacts
• Databricks notebooks very convenient for collaboration, documentation,
code sharing and reuse, results dissemination
22
23
Interested in search & recommendations, multi-sided matching or
online labor marketplace optimization? We are hiring!
Interested in doing work only when you want work?
Join Upwork as contractors!
Ad

More Related Content

What's hot (20)

Modernizing to a Cloud Data Architecture
Modernizing to a Cloud Data ArchitectureModernizing to a Cloud Data Architecture
Modernizing to a Cloud Data Architecture
Databricks
 
Making Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMaking Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse Technology
Matei Zaharia
 
Learn to Use Databricks for the Full ML Lifecycle
Learn to Use Databricks for the Full ML LifecycleLearn to Use Databricks for the Full ML Lifecycle
Learn to Use Databricks for the Full ML Lifecycle
Databricks
 
Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020
Julien Le Dem
 
Data saturday Oslo Azure Purview Erwin de Kreuk
Data saturday Oslo Azure Purview Erwin de KreukData saturday Oslo Azure Purview Erwin de Kreuk
Data saturday Oslo Azure Purview Erwin de Kreuk
Erwin de Kreuk
 
Building a modern data warehouse
Building a modern data warehouseBuilding a modern data warehouse
Building a modern data warehouse
James Serra
 
Introduction to Azure Databricks
Introduction to Azure DatabricksIntroduction to Azure Databricks
Introduction to Azure Databricks
James Serra
 
Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta Lake
Databricks
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a Lakehouse
Databricks
 
Webinar: Decoding the Mystery - How to Know if You Need a Data Catalog, a Dat...
Webinar: Decoding the Mystery - How to Know if You Need a Data Catalog, a Dat...Webinar: Decoding the Mystery - How to Know if You Need a Data Catalog, a Dat...
Webinar: Decoding the Mystery - How to Know if You Need a Data Catalog, a Dat...
DATAVERSITY
 
Unified MLOps: Feature Stores & Model Deployment
Unified MLOps: Feature Stores & Model DeploymentUnified MLOps: Feature Stores & Model Deployment
Unified MLOps: Feature Stores & Model Deployment
Databricks
 
Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0
Databricks
 
Data platform modernization with Databricks.pptx
Data platform modernization with Databricks.pptxData platform modernization with Databricks.pptx
Data platform modernization with Databricks.pptx
CalvinSim10
 
Azure Databricks - An Introduction (by Kris Bock)
Azure Databricks - An Introduction (by Kris Bock)Azure Databricks - An Introduction (by Kris Bock)
Azure Databricks - An Introduction (by Kris Bock)
Daniel Toomey
 
Data Architecture Best Practices for Advanced Analytics
Data Architecture Best Practices for Advanced AnalyticsData Architecture Best Practices for Advanced Analytics
Data Architecture Best Practices for Advanced Analytics
DATAVERSITY
 
You Need a Data Catalog. Do You Know Why?
You Need a Data Catalog. Do You Know Why?You Need a Data Catalog. Do You Know Why?
You Need a Data Catalog. Do You Know Why?
Precisely
 
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Simplilearn
 
Getting Started with Delta Lake on Databricks
Getting Started with Delta Lake on DatabricksGetting Started with Delta Lake on Databricks
Getting Started with Delta Lake on Databricks
Knoldus Inc.
 
Data Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to MeshData Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to Mesh
Jeffrey T. Pollock
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
Niko Vuokko
 
Modernizing to a Cloud Data Architecture
Modernizing to a Cloud Data ArchitectureModernizing to a Cloud Data Architecture
Modernizing to a Cloud Data Architecture
Databricks
 
Making Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMaking Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse Technology
Matei Zaharia
 
Learn to Use Databricks for the Full ML Lifecycle
Learn to Use Databricks for the Full ML LifecycleLearn to Use Databricks for the Full ML Lifecycle
Learn to Use Databricks for the Full ML Lifecycle
Databricks
 
Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020
Julien Le Dem
 
Data saturday Oslo Azure Purview Erwin de Kreuk
Data saturday Oslo Azure Purview Erwin de KreukData saturday Oslo Azure Purview Erwin de Kreuk
Data saturday Oslo Azure Purview Erwin de Kreuk
Erwin de Kreuk
 
Building a modern data warehouse
Building a modern data warehouseBuilding a modern data warehouse
Building a modern data warehouse
James Serra
 
Introduction to Azure Databricks
Introduction to Azure DatabricksIntroduction to Azure Databricks
Introduction to Azure Databricks
James Serra
 
Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta Lake
Databricks
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a Lakehouse
Databricks
 
Webinar: Decoding the Mystery - How to Know if You Need a Data Catalog, a Dat...
Webinar: Decoding the Mystery - How to Know if You Need a Data Catalog, a Dat...Webinar: Decoding the Mystery - How to Know if You Need a Data Catalog, a Dat...
Webinar: Decoding the Mystery - How to Know if You Need a Data Catalog, a Dat...
DATAVERSITY
 
Unified MLOps: Feature Stores & Model Deployment
Unified MLOps: Feature Stores & Model DeploymentUnified MLOps: Feature Stores & Model Deployment
Unified MLOps: Feature Stores & Model Deployment
Databricks
 
Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0
Databricks
 
Data platform modernization with Databricks.pptx
Data platform modernization with Databricks.pptxData platform modernization with Databricks.pptx
Data platform modernization with Databricks.pptx
CalvinSim10
 
Azure Databricks - An Introduction (by Kris Bock)
Azure Databricks - An Introduction (by Kris Bock)Azure Databricks - An Introduction (by Kris Bock)
Azure Databricks - An Introduction (by Kris Bock)
Daniel Toomey
 
Data Architecture Best Practices for Advanced Analytics
Data Architecture Best Practices for Advanced AnalyticsData Architecture Best Practices for Advanced Analytics
Data Architecture Best Practices for Advanced Analytics
DATAVERSITY
 
You Need a Data Catalog. Do You Know Why?
You Need a Data Catalog. Do You Know Why?You Need a Data Catalog. Do You Know Why?
You Need a Data Catalog. Do You Know Why?
Precisely
 
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Simplilearn
 
Getting Started with Delta Lake on Databricks
Getting Started with Delta Lake on DatabricksGetting Started with Delta Lake on Databricks
Getting Started with Delta Lake on Databricks
Knoldus Inc.
 
Data Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to MeshData Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to Mesh
Jeffrey T. Pollock
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
Niko Vuokko
 

Similar to How to Rebuild an End-to-End ML Pipeline with Databricks and Upwork with Thanh Tran (20)

Performance Characterization and Optimization of In-Memory Data Analytics on ...
Performance Characterization and Optimization of In-Memory Data Analytics on ...Performance Characterization and Optimization of In-Memory Data Analytics on ...
Performance Characterization and Optimization of In-Memory Data Analytics on ...
Ahsan Javed Awan
 
Reducing Cost of Production ML: Feature Engineering Case Study
Reducing Cost of Production ML: Feature Engineering Case StudyReducing Cost of Production ML: Feature Engineering Case Study
Reducing Cost of Production ML: Feature Engineering Case Study
Venkata Pingali
 
Real time data viz with Spark Streaming, Kafka and D3.js
Real time data viz with Spark Streaming, Kafka and D3.jsReal time data viz with Spark Streaming, Kafka and D3.js
Real time data viz with Spark Streaming, Kafka and D3.js
Ben Laird
 
Ai platform at scale
Ai platform at scaleAi platform at scale
Ai platform at scale
Henry Saputra
 
Simply Business' Data Platform
Simply Business' Data PlatformSimply Business' Data Platform
Simply Business' Data Platform
Dani Solà Lagares
 
Lambda architecture with Spark
Lambda architecture with SparkLambda architecture with Spark
Lambda architecture with Spark
Vincent GALOPIN
 
Sl boston 05_12_15_ener_noc_final_public
Sl boston 05_12_15_ener_noc_final_publicSl boston 05_12_15_ener_noc_final_public
Sl boston 05_12_15_ener_noc_final_public
Splunk
 
4070949. 89-Test-12-File.pdf
4070949.             89-Test-12-File.pdf4070949.             89-Test-12-File.pdf
4070949. 89-Test-12-File.pdf
raypoll198
 
Graph Hardware Architecture - Enterprise graphs deserve great hardware!
Graph Hardware Architecture - Enterprise graphs deserve great hardware!Graph Hardware Architecture - Enterprise graphs deserve great hardware!
Graph Hardware Architecture - Enterprise graphs deserve great hardware!
TigerGraph
 
Ajith_kumar_4.3 Years_Informatica_ETL
Ajith_kumar_4.3 Years_Informatica_ETLAjith_kumar_4.3 Years_Informatica_ETL
Ajith_kumar_4.3 Years_Informatica_ETL
Ajith Kumar Pampatti
 
SharePoint Best Practices Conference 2013
SharePoint Best Practices Conference 2013SharePoint Best Practices Conference 2013
SharePoint Best Practices Conference 2013
Mike Brannon
 
Enabling Data centric Teams
Enabling Data centric TeamsEnabling Data centric Teams
Enabling Data centric Teams
Data Con LA
 
GCP Online Training | GCP Data Engineer Online Course
GCP Online Training | GCP Data Engineer Online CourseGCP Online Training | GCP Data Engineer Online Course
GCP Online Training | GCP Data Engineer Online Course
Jayanthvisualpath
 
EISmartwork Plant Digitization toward Industrial 4.0
EISmartwork Plant Digitization toward Industrial 4.0EISmartwork Plant Digitization toward Industrial 4.0
EISmartwork Plant Digitization toward Industrial 4.0
Lee Kian Lie
 
Python + MPP Database = Large Scale AI/ML Projects in Production Faster
Python + MPP Database = Large Scale AI/ML Projects in Production FasterPython + MPP Database = Large Scale AI/ML Projects in Production Faster
Python + MPP Database = Large Scale AI/ML Projects in Production Faster
Paige_Roberts
 
Transforms Document Management at Scale with Distributed Database Solution wi...
Transforms Document Management at Scale with Distributed Database Solution wi...Transforms Document Management at Scale with Distributed Database Solution wi...
Transforms Document Management at Scale with Distributed Database Solution wi...
DataStax Academy
 
Serhii Kholodniuk: What you need to know, before migrating data platform to G...
Serhii Kholodniuk: What you need to know, before migrating data platform to G...Serhii Kholodniuk: What you need to know, before migrating data platform to G...
Serhii Kholodniuk: What you need to know, before migrating data platform to G...
Lviv Startup Club
 
Data lineage
Data lineageData lineage
Data lineage
GirishLingappa
 
Data Engineer Course In Bangalore-October
Data Engineer Course In Bangalore-OctoberData Engineer Course In Bangalore-October
Data Engineer Course In Bangalore-October
DataMites
 
Big Data Processing Beyond MapReduce by Dr. Flavio Villanustre
Big Data Processing Beyond MapReduce by Dr. Flavio VillanustreBig Data Processing Beyond MapReduce by Dr. Flavio Villanustre
Big Data Processing Beyond MapReduce by Dr. Flavio Villanustre
HPCC Systems
 
Performance Characterization and Optimization of In-Memory Data Analytics on ...
Performance Characterization and Optimization of In-Memory Data Analytics on ...Performance Characterization and Optimization of In-Memory Data Analytics on ...
Performance Characterization and Optimization of In-Memory Data Analytics on ...
Ahsan Javed Awan
 
Reducing Cost of Production ML: Feature Engineering Case Study
Reducing Cost of Production ML: Feature Engineering Case StudyReducing Cost of Production ML: Feature Engineering Case Study
Reducing Cost of Production ML: Feature Engineering Case Study
Venkata Pingali
 
Real time data viz with Spark Streaming, Kafka and D3.js
Real time data viz with Spark Streaming, Kafka and D3.jsReal time data viz with Spark Streaming, Kafka and D3.js
Real time data viz with Spark Streaming, Kafka and D3.js
Ben Laird
 
Ai platform at scale
Ai platform at scaleAi platform at scale
Ai platform at scale
Henry Saputra
 
Simply Business' Data Platform
Simply Business' Data PlatformSimply Business' Data Platform
Simply Business' Data Platform
Dani Solà Lagares
 
Lambda architecture with Spark
Lambda architecture with SparkLambda architecture with Spark
Lambda architecture with Spark
Vincent GALOPIN
 
Sl boston 05_12_15_ener_noc_final_public
Sl boston 05_12_15_ener_noc_final_publicSl boston 05_12_15_ener_noc_final_public
Sl boston 05_12_15_ener_noc_final_public
Splunk
 
4070949. 89-Test-12-File.pdf
4070949.             89-Test-12-File.pdf4070949.             89-Test-12-File.pdf
4070949. 89-Test-12-File.pdf
raypoll198
 
Graph Hardware Architecture - Enterprise graphs deserve great hardware!
Graph Hardware Architecture - Enterprise graphs deserve great hardware!Graph Hardware Architecture - Enterprise graphs deserve great hardware!
Graph Hardware Architecture - Enterprise graphs deserve great hardware!
TigerGraph
 
Ajith_kumar_4.3 Years_Informatica_ETL
Ajith_kumar_4.3 Years_Informatica_ETLAjith_kumar_4.3 Years_Informatica_ETL
Ajith_kumar_4.3 Years_Informatica_ETL
Ajith Kumar Pampatti
 
SharePoint Best Practices Conference 2013
SharePoint Best Practices Conference 2013SharePoint Best Practices Conference 2013
SharePoint Best Practices Conference 2013
Mike Brannon
 
Enabling Data centric Teams
Enabling Data centric TeamsEnabling Data centric Teams
Enabling Data centric Teams
Data Con LA
 
GCP Online Training | GCP Data Engineer Online Course
GCP Online Training | GCP Data Engineer Online CourseGCP Online Training | GCP Data Engineer Online Course
GCP Online Training | GCP Data Engineer Online Course
Jayanthvisualpath
 
EISmartwork Plant Digitization toward Industrial 4.0
EISmartwork Plant Digitization toward Industrial 4.0EISmartwork Plant Digitization toward Industrial 4.0
EISmartwork Plant Digitization toward Industrial 4.0
Lee Kian Lie
 
Python + MPP Database = Large Scale AI/ML Projects in Production Faster
Python + MPP Database = Large Scale AI/ML Projects in Production FasterPython + MPP Database = Large Scale AI/ML Projects in Production Faster
Python + MPP Database = Large Scale AI/ML Projects in Production Faster
Paige_Roberts
 
Transforms Document Management at Scale with Distributed Database Solution wi...
Transforms Document Management at Scale with Distributed Database Solution wi...Transforms Document Management at Scale with Distributed Database Solution wi...
Transforms Document Management at Scale with Distributed Database Solution wi...
DataStax Academy
 
Serhii Kholodniuk: What you need to know, before migrating data platform to G...
Serhii Kholodniuk: What you need to know, before migrating data platform to G...Serhii Kholodniuk: What you need to know, before migrating data platform to G...
Serhii Kholodniuk: What you need to know, before migrating data platform to G...
Lviv Startup Club
 
Data Engineer Course In Bangalore-October
Data Engineer Course In Bangalore-OctoberData Engineer Course In Bangalore-October
Data Engineer Course In Bangalore-October
DataMites
 
Big Data Processing Beyond MapReduce by Dr. Flavio Villanustre
Big Data Processing Beyond MapReduce by Dr. Flavio VillanustreBig Data Processing Beyond MapReduce by Dr. Flavio Villanustre
Big Data Processing Beyond MapReduce by Dr. Flavio Villanustre
HPCC Systems
 
Ad

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Ad

Recently uploaded (20)

RAG Chatbot using AWS Bedrock and Streamlit Framework
RAG Chatbot using AWS Bedrock and Streamlit FrameworkRAG Chatbot using AWS Bedrock and Streamlit Framework
RAG Chatbot using AWS Bedrock and Streamlit Framework
apanneer
 
Time series for yotube_1_data anlysis.pdf
Time series for yotube_1_data anlysis.pdfTime series for yotube_1_data anlysis.pdf
Time series for yotube_1_data anlysis.pdf
asmaamahmoudsaeed
 
Process Mining at Deutsche Bank - Journey
Process Mining at Deutsche Bank - JourneyProcess Mining at Deutsche Bank - Journey
Process Mining at Deutsche Bank - Journey
Process mining Evangelist
 
Process Mining as Enabler for Digital Transformations
Process Mining as Enabler for Digital TransformationsProcess Mining as Enabler for Digital Transformations
Process Mining as Enabler for Digital Transformations
Process mining Evangelist
 
hersh's midterm project.pdf music retail and distribution
hersh's midterm project.pdf music retail and distributionhersh's midterm project.pdf music retail and distribution
hersh's midterm project.pdf music retail and distribution
hershtara1
 
Controlling Financial Processes at a Municipality
Controlling Financial Processes at a MunicipalityControlling Financial Processes at a Municipality
Controlling Financial Processes at a Municipality
Process mining Evangelist
 
national income & related aggregates (1)(1).pptx
national income & related aggregates (1)(1).pptxnational income & related aggregates (1)(1).pptx
national income & related aggregates (1)(1).pptx
j2492618
 
2-Raction quotient_١٠٠١٤٦.ppt of physical chemisstry
2-Raction quotient_١٠٠١٤٦.ppt of physical chemisstry2-Raction quotient_١٠٠١٤٦.ppt of physical chemisstry
2-Raction quotient_١٠٠١٤٦.ppt of physical chemisstry
bastakwyry
 
Feature Engineering for Electronic Health Record Systems
Feature Engineering for Electronic Health Record SystemsFeature Engineering for Electronic Health Record Systems
Feature Engineering for Electronic Health Record Systems
Process mining Evangelist
 
Z14_IBM__APL_by_Christian_Demmer_IBM.pdf
Z14_IBM__APL_by_Christian_Demmer_IBM.pdfZ14_IBM__APL_by_Christian_Demmer_IBM.pdf
Z14_IBM__APL_by_Christian_Demmer_IBM.pdf
Fariborz Seyedloo
 
Oral Malodor.pptx jsjshdhushehsidjjeiejdhfj
Oral Malodor.pptx jsjshdhushehsidjjeiejdhfjOral Malodor.pptx jsjshdhushehsidjjeiejdhfj
Oral Malodor.pptx jsjshdhushehsidjjeiejdhfj
maitripatel5301
 
problem solving.presentation slideshow bsc nursing
problem solving.presentation slideshow bsc nursingproblem solving.presentation slideshow bsc nursing
problem solving.presentation slideshow bsc nursing
vishnudathas123
 
Introduction to systems thinking tools_Eng.pdf
Introduction to systems thinking tools_Eng.pdfIntroduction to systems thinking tools_Eng.pdf
Introduction to systems thinking tools_Eng.pdf
AbdurahmanAbd
 
Lesson 6-Interviewing in SHRM_updated.pdf
Lesson 6-Interviewing in SHRM_updated.pdfLesson 6-Interviewing in SHRM_updated.pdf
Lesson 6-Interviewing in SHRM_updated.pdf
hemelali11
 
How to Set Up Process Mining in a Decentralized Organization?
How to Set Up Process Mining in a Decentralized Organization?How to Set Up Process Mining in a Decentralized Organization?
How to Set Up Process Mining in a Decentralized Organization?
Process mining Evangelist
 
Fundamentals of Data Analysis, its types, tools, algorithms
Fundamentals of Data Analysis, its types, tools, algorithmsFundamentals of Data Analysis, its types, tools, algorithms
Fundamentals of Data Analysis, its types, tools, algorithms
priyaiyerkbcsc
 
Language Learning App Data Research by Globibo [2025]
Language Learning App Data Research by Globibo [2025]Language Learning App Data Research by Globibo [2025]
Language Learning App Data Research by Globibo [2025]
globibo
 
CS-404 COA COURSE FILE JAN JUN 2025.docx
CS-404 COA COURSE FILE JAN JUN 2025.docxCS-404 COA COURSE FILE JAN JUN 2025.docx
CS-404 COA COURSE FILE JAN JUN 2025.docx
nidarizvitit
 
Day 1 MS Excel Basics #.pptxDay 1 MS Excel Basics #.pptxDay 1 MS Excel Basics...
Day 1 MS Excel Basics #.pptxDay 1 MS Excel Basics #.pptxDay 1 MS Excel Basics...Day 1 MS Excel Basics #.pptxDay 1 MS Excel Basics #.pptxDay 1 MS Excel Basics...
Day 1 MS Excel Basics #.pptxDay 1 MS Excel Basics #.pptxDay 1 MS Excel Basics...
Jayantilal Bhanushali
 
Dynamics 365 Business Rules Dynamics Dynamics
Dynamics 365 Business Rules Dynamics DynamicsDynamics 365 Business Rules Dynamics Dynamics
Dynamics 365 Business Rules Dynamics Dynamics
heyoubro69
 
RAG Chatbot using AWS Bedrock and Streamlit Framework
RAG Chatbot using AWS Bedrock and Streamlit FrameworkRAG Chatbot using AWS Bedrock and Streamlit Framework
RAG Chatbot using AWS Bedrock and Streamlit Framework
apanneer
 
Time series for yotube_1_data anlysis.pdf
Time series for yotube_1_data anlysis.pdfTime series for yotube_1_data anlysis.pdf
Time series for yotube_1_data anlysis.pdf
asmaamahmoudsaeed
 
Process Mining as Enabler for Digital Transformations
Process Mining as Enabler for Digital TransformationsProcess Mining as Enabler for Digital Transformations
Process Mining as Enabler for Digital Transformations
Process mining Evangelist
 
hersh's midterm project.pdf music retail and distribution
hersh's midterm project.pdf music retail and distributionhersh's midterm project.pdf music retail and distribution
hersh's midterm project.pdf music retail and distribution
hershtara1
 
Controlling Financial Processes at a Municipality
Controlling Financial Processes at a MunicipalityControlling Financial Processes at a Municipality
Controlling Financial Processes at a Municipality
Process mining Evangelist
 
national income & related aggregates (1)(1).pptx
national income & related aggregates (1)(1).pptxnational income & related aggregates (1)(1).pptx
national income & related aggregates (1)(1).pptx
j2492618
 
2-Raction quotient_١٠٠١٤٦.ppt of physical chemisstry
2-Raction quotient_١٠٠١٤٦.ppt of physical chemisstry2-Raction quotient_١٠٠١٤٦.ppt of physical chemisstry
2-Raction quotient_١٠٠١٤٦.ppt of physical chemisstry
bastakwyry
 
Feature Engineering for Electronic Health Record Systems
Feature Engineering for Electronic Health Record SystemsFeature Engineering for Electronic Health Record Systems
Feature Engineering for Electronic Health Record Systems
Process mining Evangelist
 
Z14_IBM__APL_by_Christian_Demmer_IBM.pdf
Z14_IBM__APL_by_Christian_Demmer_IBM.pdfZ14_IBM__APL_by_Christian_Demmer_IBM.pdf
Z14_IBM__APL_by_Christian_Demmer_IBM.pdf
Fariborz Seyedloo
 
Oral Malodor.pptx jsjshdhushehsidjjeiejdhfj
Oral Malodor.pptx jsjshdhushehsidjjeiejdhfjOral Malodor.pptx jsjshdhushehsidjjeiejdhfj
Oral Malodor.pptx jsjshdhushehsidjjeiejdhfj
maitripatel5301
 
problem solving.presentation slideshow bsc nursing
problem solving.presentation slideshow bsc nursingproblem solving.presentation slideshow bsc nursing
problem solving.presentation slideshow bsc nursing
vishnudathas123
 
Introduction to systems thinking tools_Eng.pdf
Introduction to systems thinking tools_Eng.pdfIntroduction to systems thinking tools_Eng.pdf
Introduction to systems thinking tools_Eng.pdf
AbdurahmanAbd
 
Lesson 6-Interviewing in SHRM_updated.pdf
Lesson 6-Interviewing in SHRM_updated.pdfLesson 6-Interviewing in SHRM_updated.pdf
Lesson 6-Interviewing in SHRM_updated.pdf
hemelali11
 
How to Set Up Process Mining in a Decentralized Organization?
How to Set Up Process Mining in a Decentralized Organization?How to Set Up Process Mining in a Decentralized Organization?
How to Set Up Process Mining in a Decentralized Organization?
Process mining Evangelist
 
Fundamentals of Data Analysis, its types, tools, algorithms
Fundamentals of Data Analysis, its types, tools, algorithmsFundamentals of Data Analysis, its types, tools, algorithms
Fundamentals of Data Analysis, its types, tools, algorithms
priyaiyerkbcsc
 
Language Learning App Data Research by Globibo [2025]
Language Learning App Data Research by Globibo [2025]Language Learning App Data Research by Globibo [2025]
Language Learning App Data Research by Globibo [2025]
globibo
 
CS-404 COA COURSE FILE JAN JUN 2025.docx
CS-404 COA COURSE FILE JAN JUN 2025.docxCS-404 COA COURSE FILE JAN JUN 2025.docx
CS-404 COA COURSE FILE JAN JUN 2025.docx
nidarizvitit
 
Day 1 MS Excel Basics #.pptxDay 1 MS Excel Basics #.pptxDay 1 MS Excel Basics...
Day 1 MS Excel Basics #.pptxDay 1 MS Excel Basics #.pptxDay 1 MS Excel Basics...Day 1 MS Excel Basics #.pptxDay 1 MS Excel Basics #.pptxDay 1 MS Excel Basics...
Day 1 MS Excel Basics #.pptxDay 1 MS Excel Basics #.pptxDay 1 MS Excel Basics...
Jayantilal Bhanushali
 
Dynamics 365 Business Rules Dynamics Dynamics
Dynamics 365 Business Rules Dynamics DynamicsDynamics 365 Business Rules Dynamics Dynamics
Dynamics 365 Business Rules Dynamics Dynamics
heyoubro69
 

How to Rebuild an End-to-End ML Pipeline with Databricks and Upwork with Thanh Tran

  • 1. Data Science Infrastructure Team, Thanh Tran Upwork How to Rebuild Data and ML Platform using Kinesis, S3, Spark, MLlib, Databricks, Airflow and Upwork #AssignedHashtagGoesHere
  • 3. 3 WE I OUR Nikolay Melnik Lead ML Engineer Ukraine Dimitris Manikis Senior Data Engineer Greece Artem Moskvin Data/ML Engineer Germany Roman Tkachuk Senior Data Engineer Ukraine Andrei Demus Data/ML Engineer Ukraine Igor Korsunov ML Engineer Russia Anna Lysak Data/ML Engineer Ukraine Yongtao Ma Senior ML Engineer Germany Giannis Koutsoubos Lead Backend Engineer Greece
  • 4. 4 ● Highest-skilled experts for the job ● Competitive/lower rate ● Mix of long-term and project-based staff QUALITY COST/EARNING AGILITY ● Work on cutting-edge projects ● Happy with competitive compensation + flexibility in location and work hours ● Work only when they want work My Team With Upwork, our new hires AND I are better off! Me
  • 5. 5 We believe significant welfare improvements can be achieved through data science driven optimization of the online labor marketplace.
  • 6. 6 We have the biggest closed-loop online dataset of jobs and job seekers in labor history. Contract progress (~1B) Feedback (~10M) Web site activity (~10B) Money transactions (~100M) Profiles (~10M) Job Posts (~10M) Proposals (~100M) Messages (~100M) Hiring decisions (~10M) Contract progress (~1B) Feedback (~10M) Web site activity (~10B) Money transactions (~100M)
  • 7. 7 What do we need to ship data science products?
  • 8. 8 We need to support an agile data science workflow to provide quick and validated improvements! ● Data Science analytics ○ Complete and cleansed data, single ground-truth ○ Tools for computing metrics, continuous validation ● Data Science model development ○ Business objects and UI event data ○ Scaling complex data processing and feature computation ○ Discoverability of data and features ○ Batch + live data mismatches ○ Managing, monitoring and versioning of models and experiments ○ Knowledge sharing and code reuse (experiments, model, feature computation pipeline) ○ Flexibility to accommodate variety of ML frameworks ● Data Science model productionalization ○ Minimize differences between trained model and production code ○ Code modularized, tested, integrated into CI/CD workflow ○ Standardized model serving that is scalable, available, high throughput, low latency...
  • 9. Upwork Data & ML Platform 9
  • 10. 10
  • 11. 11 ● Kinesis and Structure Spark Streaming for high throughput live event data processing ● Moving away from traditional DWH solution to distributed Spark-based batch data processing to avoid performance issues and workload limitations ● Spark MLlib + Tensorflow as core ML libraries to balance the tradeoff between flexibility and standardized model engineering ● Data processing, feature computation and pipeline retraining jobs scheduled and orchestrated via Airflow ● Experiment management and model versioning integral part of CI/CD workflow ● Adopt engineer CI/CD workflow to data science using Jenkins, Databricks and Airflow: standalone model testing + live regression test helps to identify batch and live data mismatches ● Spark-based pipeline developed by data scientists directly used for model scoring in production environment ● Microservices for streamlined model serving, scalability, availability... ● Extensive use of Databricks notebook-based documentation of model, experiments and feature engineering code ● Graphite, ELK and Pagerduty for logging, monitoring and alerts
  • 12. Batch Data & ML Environment 12
  • 13. 13
  • 14. 14
  • 15. Live Data & ML Pipeline 15
  • 16. 16
  • 17. 17
  • 19. 19
  • 20. 20
  • 21. Pitfalls and Lessons Learned 21
  • 22. • Microservices can lead to data fragmentation and high downstream processing overhead • Structured streaming latency when number of Kinesis consumers is high • Stream-to-stream/stream-to-batch join not suitable for real-time use cases yet • Differences between live and batch data • Differences between trained vs. deployed ML pipeline can be minimized • CI/CD needs to be customized to support data science workflow and artefacts • Databricks notebooks very convenient for collaboration, documentation, code sharing and reuse, results dissemination 22
  • 23. 23 Interested in search & recommendations, multi-sided matching or online labor marketplace optimization? We are hiring! Interested in doing work only when you want work? Join Upwork as contractors!
  翻译: