SlideShare a Scribd company logo
Practical Data Science
Implementation on AWS
Ding Li 2021.8
2
1. Analyze Datasets and Train
ML Models using AutoML
3
Data Science and Cloud
4
Register Data with AWS Glue and Query Data with Athena
5
Data Visualization
6
Statistical Bias and SageMaker Clarify
Covariant Drift: distribution of the independent variables or the features can change.
Prior Probability Drift: data distribution of your labels or the targeted variables might change.
Concept Drift: relationship between the features and the labels can change. Concept drift also
called as concept shift can happen when the definition of the label itself changes based
on
a particular feature like age or geographical location.
Measure
Class Imbalance (CI)
• Measures the imbalance in the number of examples that are provided for different facet values.
• Does a particular product category have disproportionately large number of total reviews than
any other category in the dataset?
Difference in Proportions of Labels (DPL)
• Measures the imbalance of positive outcomes between the different facet values.
• If a particular product category has disproportionately higher ratings than other categories.
Amazon SageMaker Clarify
7
Feature Importance SHAP
Rank the individual features in the order of their importance and
contribution to the final model.
SHAP (SHapley Additive exPlanations) GitHub paper YouTube
A game theoretic approach to explain the output of any machine
learning model. It connects optimal credit allocation with local
explanations using the classic Shapley values from game theory and
their related extensions
New Data Flow
Import Data
Add Data Analysis
Feature Importance
8
• Auto ML allows for experts to focus on those hard problems that can't be solved through Auto ML.
• Auto ML can reduce the repetitive work, experts can apply their domain to analyze the results
9
Automatic data pre-processing and feature engineering
• Automatic data pre-processing and feature engineering automatically fills in the missing data, provides statistical insights about columns in your dataset, and automatically
extracts information from non-numeric columns, such as date and time information from timestamps.
• Automatic ML model selection automatically infers the type of predictions that best suit your data, such as binary classification, multi-class classification, or regression. SageMaker
Autopilot then explores high-performing algorithms such as gradient boosting decision tree, feedforward deep neural networks, and logistic regression, and trains and optimizes hundreds of models based
on these algorithms to find the model that best fits your data.
• Model leaderboard can view the list of models, ranked by metrics such as accuracy, precision, recall, and area under the curve (AUC), review model details such as the impact of features on
predictions, and deploy the model that is best suited to your use case.
10
Amazon SageMaker Built-in Algorithms
11
Explore the Use Case and Analyze the Dataset:
• AWS Data Wrangler
• AWS Glue
• Amazon Athena
• Matplotlib
• Seaborn
• Pandas
• Numpy
Data Bias and Feature Importance:
• Measure Pretraining Bias - Amazon SageMaker
• SHAP
Automated Machine Learning:
• Amazon SageMaker Autopilot
Built-in algorithms:
• Elastic Machine Learning Algorithms in Amazon SageMaker
• Word2Vec algorithm
• GloVe algorithm
• FastText algorithm
• Transformer architecture, "Attention Is All You Need"
• BlazingText algorithm
• ELMo algorithm
• GPT model architecture
• BERT model architecture
• Built-in algorithms
• Amazon SageMaker BlazingText
12
2. Build, Train, and Deploy ML
Pipelines using BERT
13
• Dataset best fits the algorithm
• Improve ML model performance
Feature Engineering Steps
Feature Engineering Pipeline
Split Dataset
Feature Engineering
14
BERT Embedding
SageMaker Processing with scikit-learn
Parameters: code, processingInput, processingOutput
15
Feature Store – Reuse the feature engineering results
Centralized Reusable Discoverable
16
17
18
19
20
21
22
Artifact
• the output of a step or task can be consumed the next
step in a pipeline or deployed directly for consumption
SageMaker Pipelines
23
24
Feature Engineering and Feature Store:
• RoBERTa: A Robustly Optimized BERT Pretraining Approach
• Fundamental Techniques of Feature Engineering for Machine Learning
Train, Debug, and Profile a Machine Learning Model:
• PyTorch Hub
• TensorFlow Hub
• Hugging Face open-source NLP transformers library
• RoBERTa model
• Amazon SageMaker Model Training (Developer Guide)
• Amazon SageMaker Debugger: A system for real-time insights into machine learning model training
• The science behind SageMaker’s cost-saving Debugger
• Amazon SageMaker Debugger (Developer Guide)
• Amazon SageMaker Debugger (GitHub)
Deploy End-To-End Machine Learning Pipelines:
• A Chat with Andrew on MLOps: From Model-centric to Data-centric AI
25
3. Optimize ML Models and Deploy
Human-in-the-Loop Pipelines
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
Advanced model training, tuning, and evaluation:
• Hyperband
• Bayesian Optimization
• Amazon SageMaker Automatic Model Tuning
Advanced model deployment, and monitoring:
• A/B Testing
• Autoscaling
• Multi-armed bandit
• Batch Transform
• Inference Pipeline
• Model Monitor
Data labeling and human-in-the-loop pipelines:
• Towards Automated Data Quality Management for Machine Learning
• Amazon SageMaker Ground Truth Developer Guide
• Create high-quality instructions for Amazon SageMaker Ground Truth labeling jobs
• Amazon SageMaker Augmented AI (Amazon A2I) Developer Guide
Ad

More Related Content

What's hot (20)

AWS VS AZURE VS GCP.pptx
AWS VS AZURE VS GCP.pptxAWS VS AZURE VS GCP.pptx
AWS VS AZURE VS GCP.pptx
Raneesh Ramesan
 
AWS reInvent 2023 re:Cap services Slide deck
AWS reInvent 2023 re:Cap services Slide deckAWS reInvent 2023 re:Cap services Slide deck
AWS reInvent 2023 re:Cap services Slide deck
Sammy Cheung
 
MLOps and Reproducible ML on AWS with Kubeflow and SageMaker
MLOps and Reproducible ML on AWS with Kubeflow and SageMakerMLOps and Reproducible ML on AWS with Kubeflow and SageMaker
MLOps and Reproducible ML on AWS with Kubeflow and SageMaker
Provectus
 
Data mining
Data miningData mining
Data mining
Kinza Razzaq
 
Azure Data Engineering.pptx
Azure Data Engineering.pptxAzure Data Engineering.pptx
Azure Data Engineering.pptx
priyadharshini626440
 
Vector database
Vector databaseVector database
Vector database
Guy Korland
 
Black Friday Shopping Prediction_ PPT
Black Friday Shopping Prediction_ PPTBlack Friday Shopping Prediction_ PPT
Black Friday Shopping Prediction_ PPT
ArjunThumbayil
 
Suresh Poopandi_Generative AI On AWS-MidWestCommunityDay-Final.pdf
Suresh Poopandi_Generative AI On AWS-MidWestCommunityDay-Final.pdfSuresh Poopandi_Generative AI On AWS-MidWestCommunityDay-Final.pdf
Suresh Poopandi_Generative AI On AWS-MidWestCommunityDay-Final.pdf
AWS Chicago
 
Enterprise Data Lake
Enterprise Data LakeEnterprise Data Lake
Enterprise Data Lake
sambiswal
 
Introducing Neo4j
Introducing Neo4jIntroducing Neo4j
Introducing Neo4j
Neo4j
 
Big Data & Analytics Architecture
Big Data & Analytics ArchitectureBig Data & Analytics Architecture
Big Data & Analytics Architecture
Arvind Sathi
 
Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Anal...
Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Anal...Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Anal...
Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Anal...
Databricks
 
Modernize & Automate Analytics Data Pipelines
Modernize & Automate Analytics Data PipelinesModernize & Automate Analytics Data Pipelines
Modernize & Automate Analytics Data Pipelines
Carole Gunst
 
MLOps by Sasha Rosenbaum
MLOps by Sasha RosenbaumMLOps by Sasha Rosenbaum
MLOps by Sasha Rosenbaum
Sasha Rosenbaum
 
Real time analytics
Real time analyticsReal time analytics
Real time analytics
Leandro Totino Pereira
 
Amazon SageMaker for MLOps Presentation.
Amazon SageMaker for MLOps Presentation.Amazon SageMaker for MLOps Presentation.
Amazon SageMaker for MLOps Presentation.
Knoldus Inc.
 
How to Start Your Application Modernization Journey
How to Start Your Application Modernization JourneyHow to Start Your Application Modernization Journey
How to Start Your Application Modernization Journey
VMware Tanzu
 
Data visualization
Data visualizationData visualization
Data visualization
Jan Willem Tulp
 
Machine Learning Pitch Deck
Machine Learning Pitch DeckMachine Learning Pitch Deck
Machine Learning Pitch Deck
Nicholas Vossburg
 
stackconf 2022: Introduction to Vector Search with Weaviate
stackconf 2022: Introduction to Vector Search with Weaviatestackconf 2022: Introduction to Vector Search with Weaviate
stackconf 2022: Introduction to Vector Search with Weaviate
NETWAYS
 
AWS VS AZURE VS GCP.pptx
AWS VS AZURE VS GCP.pptxAWS VS AZURE VS GCP.pptx
AWS VS AZURE VS GCP.pptx
Raneesh Ramesan
 
AWS reInvent 2023 re:Cap services Slide deck
AWS reInvent 2023 re:Cap services Slide deckAWS reInvent 2023 re:Cap services Slide deck
AWS reInvent 2023 re:Cap services Slide deck
Sammy Cheung
 
MLOps and Reproducible ML on AWS with Kubeflow and SageMaker
MLOps and Reproducible ML on AWS with Kubeflow and SageMakerMLOps and Reproducible ML on AWS with Kubeflow and SageMaker
MLOps and Reproducible ML on AWS with Kubeflow and SageMaker
Provectus
 
Black Friday Shopping Prediction_ PPT
Black Friday Shopping Prediction_ PPTBlack Friday Shopping Prediction_ PPT
Black Friday Shopping Prediction_ PPT
ArjunThumbayil
 
Suresh Poopandi_Generative AI On AWS-MidWestCommunityDay-Final.pdf
Suresh Poopandi_Generative AI On AWS-MidWestCommunityDay-Final.pdfSuresh Poopandi_Generative AI On AWS-MidWestCommunityDay-Final.pdf
Suresh Poopandi_Generative AI On AWS-MidWestCommunityDay-Final.pdf
AWS Chicago
 
Enterprise Data Lake
Enterprise Data LakeEnterprise Data Lake
Enterprise Data Lake
sambiswal
 
Introducing Neo4j
Introducing Neo4jIntroducing Neo4j
Introducing Neo4j
Neo4j
 
Big Data & Analytics Architecture
Big Data & Analytics ArchitectureBig Data & Analytics Architecture
Big Data & Analytics Architecture
Arvind Sathi
 
Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Anal...
Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Anal...Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Anal...
Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Anal...
Databricks
 
Modernize & Automate Analytics Data Pipelines
Modernize & Automate Analytics Data PipelinesModernize & Automate Analytics Data Pipelines
Modernize & Automate Analytics Data Pipelines
Carole Gunst
 
MLOps by Sasha Rosenbaum
MLOps by Sasha RosenbaumMLOps by Sasha Rosenbaum
MLOps by Sasha Rosenbaum
Sasha Rosenbaum
 
Amazon SageMaker for MLOps Presentation.
Amazon SageMaker for MLOps Presentation.Amazon SageMaker for MLOps Presentation.
Amazon SageMaker for MLOps Presentation.
Knoldus Inc.
 
How to Start Your Application Modernization Journey
How to Start Your Application Modernization JourneyHow to Start Your Application Modernization Journey
How to Start Your Application Modernization Journey
VMware Tanzu
 
stackconf 2022: Introduction to Vector Search with Weaviate
stackconf 2022: Introduction to Vector Search with Weaviatestackconf 2022: Introduction to Vector Search with Weaviate
stackconf 2022: Introduction to Vector Search with Weaviate
NETWAYS
 

Similar to Practical data science (20)

Machine Learning and AI at Oracle
Machine Learning and AI at OracleMachine Learning and AI at Oracle
Machine Learning and AI at Oracle
Sandesh Rao
 
Augmenting Machine Learning with Databricks Labs AutoML Toolkit
Augmenting Machine Learning with Databricks Labs AutoML ToolkitAugmenting Machine Learning with Databricks Labs AutoML Toolkit
Augmenting Machine Learning with Databricks Labs AutoML Toolkit
Databricks
 
The Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkThe Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it Work
Ivo Andreev
 
Machine learning
Machine learningMachine learning
Machine learning
Saravanan Subburayal
 
Python for Machine Learning_ A Comprehensive Overview.pptx
Python for Machine Learning_ A Comprehensive Overview.pptxPython for Machine Learning_ A Comprehensive Overview.pptx
Python for Machine Learning_ A Comprehensive Overview.pptx
KuldeepSinghBrar3
 
AutoML - Heralding a New Era of Machine Learning - CASOUG Oct 2021
AutoML - Heralding a New Era of Machine Learning - CASOUG Oct 2021AutoML - Heralding a New Era of Machine Learning - CASOUG Oct 2021
AutoML - Heralding a New Era of Machine Learning - CASOUG Oct 2021
Sandesh Rao
 
Guiding through a typical Machine Learning Pipeline
Guiding through a typical Machine Learning PipelineGuiding through a typical Machine Learning Pipeline
Guiding through a typical Machine Learning Pipeline
Michael Gerke
 
Augmenting Machine Learning with Databricks Labs AutoML Toolkit
Augmenting Machine Learning with Databricks Labs AutoML ToolkitAugmenting Machine Learning with Databricks Labs AutoML Toolkit
Augmenting Machine Learning with Databricks Labs AutoML Toolkit
Databricks
 
MLIntro_ADA.pptx
MLIntro_ADA.pptxMLIntro_ADA.pptx
MLIntro_ADA.pptx
ADA Consulting
 
AI/ML Infra Meetup | ML explainability in Michelangelo
AI/ML Infra Meetup | ML explainability in MichelangeloAI/ML Infra Meetup | ML explainability in Michelangelo
AI/ML Infra Meetup | ML explainability in Michelangelo
Alluxio, Inc.
 
[DSC Europe 23] Petar Zecevic - ML in Production on Databricks
[DSC Europe 23] Petar Zecevic - ML in Production on Databricks[DSC Europe 23] Petar Zecevic - ML in Production on Databricks
[DSC Europe 23] Petar Zecevic - ML in Production on Databricks
DataScienceConferenc1
 
The Data Science Process - Do we need it and how to apply?
The Data Science Process - Do we need it and how to apply?The Data Science Process - Do we need it and how to apply?
The Data Science Process - Do we need it and how to apply?
Ivo Andreev
 
MLOPS By Amazon offered and free download
MLOPS By Amazon offered and free downloadMLOPS By Amazon offered and free download
MLOPS By Amazon offered and free download
pouyan533
 
Building machine learning inference pipelines at scale (March 2019)
Building machine learning inference pipelines at scale (March 2019)Building machine learning inference pipelines at scale (March 2019)
Building machine learning inference pipelines at scale (March 2019)
Julien SIMON
 
AlphaPy: A Data Science Pipeline in Python
AlphaPy: A Data Science Pipeline in PythonAlphaPy: A Data Science Pipeline in Python
AlphaPy: A Data Science Pipeline in Python
Mark Conway
 
AlphaPy
AlphaPyAlphaPy
AlphaPy
Robert Scott
 
.Net development with Azure Machine Learning (AzureML) Nov 2014
.Net development with Azure Machine Learning (AzureML) Nov 2014.Net development with Azure Machine Learning (AzureML) Nov 2014
.Net development with Azure Machine Learning (AzureML) Nov 2014
Mark Tabladillo
 
Unlocking DataDriven Talent Intelligence Transforming TALENTX with Industry P...
Unlocking DataDriven Talent Intelligence Transforming TALENTX with Industry P...Unlocking DataDriven Talent Intelligence Transforming TALENTX with Industry P...
Unlocking DataDriven Talent Intelligence Transforming TALENTX with Industry P...
Prasanna Hegde
 
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning ModelsApache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Anyscale
 
Machine Learning Models in Production
Machine Learning Models in ProductionMachine Learning Models in Production
Machine Learning Models in Production
DataWorks Summit
 
Machine Learning and AI at Oracle
Machine Learning and AI at OracleMachine Learning and AI at Oracle
Machine Learning and AI at Oracle
Sandesh Rao
 
Augmenting Machine Learning with Databricks Labs AutoML Toolkit
Augmenting Machine Learning with Databricks Labs AutoML ToolkitAugmenting Machine Learning with Databricks Labs AutoML Toolkit
Augmenting Machine Learning with Databricks Labs AutoML Toolkit
Databricks
 
The Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkThe Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it Work
Ivo Andreev
 
Python for Machine Learning_ A Comprehensive Overview.pptx
Python for Machine Learning_ A Comprehensive Overview.pptxPython for Machine Learning_ A Comprehensive Overview.pptx
Python for Machine Learning_ A Comprehensive Overview.pptx
KuldeepSinghBrar3
 
AutoML - Heralding a New Era of Machine Learning - CASOUG Oct 2021
AutoML - Heralding a New Era of Machine Learning - CASOUG Oct 2021AutoML - Heralding a New Era of Machine Learning - CASOUG Oct 2021
AutoML - Heralding a New Era of Machine Learning - CASOUG Oct 2021
Sandesh Rao
 
Guiding through a typical Machine Learning Pipeline
Guiding through a typical Machine Learning PipelineGuiding through a typical Machine Learning Pipeline
Guiding through a typical Machine Learning Pipeline
Michael Gerke
 
Augmenting Machine Learning with Databricks Labs AutoML Toolkit
Augmenting Machine Learning with Databricks Labs AutoML ToolkitAugmenting Machine Learning with Databricks Labs AutoML Toolkit
Augmenting Machine Learning with Databricks Labs AutoML Toolkit
Databricks
 
AI/ML Infra Meetup | ML explainability in Michelangelo
AI/ML Infra Meetup | ML explainability in MichelangeloAI/ML Infra Meetup | ML explainability in Michelangelo
AI/ML Infra Meetup | ML explainability in Michelangelo
Alluxio, Inc.
 
[DSC Europe 23] Petar Zecevic - ML in Production on Databricks
[DSC Europe 23] Petar Zecevic - ML in Production on Databricks[DSC Europe 23] Petar Zecevic - ML in Production on Databricks
[DSC Europe 23] Petar Zecevic - ML in Production on Databricks
DataScienceConferenc1
 
The Data Science Process - Do we need it and how to apply?
The Data Science Process - Do we need it and how to apply?The Data Science Process - Do we need it and how to apply?
The Data Science Process - Do we need it and how to apply?
Ivo Andreev
 
MLOPS By Amazon offered and free download
MLOPS By Amazon offered and free downloadMLOPS By Amazon offered and free download
MLOPS By Amazon offered and free download
pouyan533
 
Building machine learning inference pipelines at scale (March 2019)
Building machine learning inference pipelines at scale (March 2019)Building machine learning inference pipelines at scale (March 2019)
Building machine learning inference pipelines at scale (March 2019)
Julien SIMON
 
AlphaPy: A Data Science Pipeline in Python
AlphaPy: A Data Science Pipeline in PythonAlphaPy: A Data Science Pipeline in Python
AlphaPy: A Data Science Pipeline in Python
Mark Conway
 
.Net development with Azure Machine Learning (AzureML) Nov 2014
.Net development with Azure Machine Learning (AzureML) Nov 2014.Net development with Azure Machine Learning (AzureML) Nov 2014
.Net development with Azure Machine Learning (AzureML) Nov 2014
Mark Tabladillo
 
Unlocking DataDriven Talent Intelligence Transforming TALENTX with Industry P...
Unlocking DataDriven Talent Intelligence Transforming TALENTX with Industry P...Unlocking DataDriven Talent Intelligence Transforming TALENTX with Industry P...
Unlocking DataDriven Talent Intelligence Transforming TALENTX with Industry P...
Prasanna Hegde
 
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning ModelsApache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Anyscale
 
Machine Learning Models in Production
Machine Learning Models in ProductionMachine Learning Models in Production
Machine Learning Models in Production
DataWorks Summit
 
Ad

More from Ding Li (13)

Software architecture for data applications
Software architecture for data applicationsSoftware architecture for data applications
Software architecture for data applications
Ding Li
 
Seismic data analysis with u net
Seismic data analysis with u netSeismic data analysis with u net
Seismic data analysis with u net
Ding Li
 
Titanic survivor prediction by machine learning
Titanic survivor prediction by machine learningTitanic survivor prediction by machine learning
Titanic survivor prediction by machine learning
Ding Li
 
Find nuclei in images with U-net
Find nuclei in images with U-netFind nuclei in images with U-net
Find nuclei in images with U-net
Ding Li
 
Digit recognizer by convolutional neural network
Digit recognizer by convolutional neural networkDigit recognizer by convolutional neural network
Digit recognizer by convolutional neural network
Ding Li
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
Ding Li
 
Recommendation system
Recommendation systemRecommendation system
Recommendation system
Ding Li
 
Generative adversarial networks
Generative adversarial networksGenerative adversarial networks
Generative adversarial networks
Ding Li
 
AI to advance science research
AI to advance science researchAI to advance science research
AI to advance science research
Ding Li
 
Machine learning with graph
Machine learning with graphMachine learning with graph
Machine learning with graph
Ding Li
 
Natural language processing and transformer models
Natural language processing and transformer modelsNatural language processing and transformer models
Natural language processing and transformer models
Ding Li
 
Great neck school budget 2016-2017 analysis
Great neck school budget 2016-2017 analysisGreat neck school budget 2016-2017 analysis
Great neck school budget 2016-2017 analysis
Ding Li
 
Business Intelligence and Big Data in Cloud
Business Intelligence and Big Data in CloudBusiness Intelligence and Big Data in Cloud
Business Intelligence and Big Data in Cloud
Ding Li
 
Software architecture for data applications
Software architecture for data applicationsSoftware architecture for data applications
Software architecture for data applications
Ding Li
 
Seismic data analysis with u net
Seismic data analysis with u netSeismic data analysis with u net
Seismic data analysis with u net
Ding Li
 
Titanic survivor prediction by machine learning
Titanic survivor prediction by machine learningTitanic survivor prediction by machine learning
Titanic survivor prediction by machine learning
Ding Li
 
Find nuclei in images with U-net
Find nuclei in images with U-netFind nuclei in images with U-net
Find nuclei in images with U-net
Ding Li
 
Digit recognizer by convolutional neural network
Digit recognizer by convolutional neural networkDigit recognizer by convolutional neural network
Digit recognizer by convolutional neural network
Ding Li
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
Ding Li
 
Recommendation system
Recommendation systemRecommendation system
Recommendation system
Ding Li
 
Generative adversarial networks
Generative adversarial networksGenerative adversarial networks
Generative adversarial networks
Ding Li
 
AI to advance science research
AI to advance science researchAI to advance science research
AI to advance science research
Ding Li
 
Machine learning with graph
Machine learning with graphMachine learning with graph
Machine learning with graph
Ding Li
 
Natural language processing and transformer models
Natural language processing and transformer modelsNatural language processing and transformer models
Natural language processing and transformer models
Ding Li
 
Great neck school budget 2016-2017 analysis
Great neck school budget 2016-2017 analysisGreat neck school budget 2016-2017 analysis
Great neck school budget 2016-2017 analysis
Ding Li
 
Business Intelligence and Big Data in Cloud
Business Intelligence and Big Data in CloudBusiness Intelligence and Big Data in Cloud
Business Intelligence and Big Data in Cloud
Ding Li
 
Ad

Recently uploaded (20)

AWS Certified Machine Learning Slides.pdf
AWS Certified Machine Learning Slides.pdfAWS Certified Machine Learning Slides.pdf
AWS Certified Machine Learning Slides.pdf
philsparkshome
 
How to Set Up Process Mining in a Decentralized Organization?
How to Set Up Process Mining in a Decentralized Organization?How to Set Up Process Mining in a Decentralized Organization?
How to Set Up Process Mining in a Decentralized Organization?
Process mining Evangelist
 
problem solving.presentation slideshow bsc nursing
problem solving.presentation slideshow bsc nursingproblem solving.presentation slideshow bsc nursing
problem solving.presentation slideshow bsc nursing
vishnudathas123
 
indonesia-gen-z-report-2024 Gen Z (born between 1997 and 2012) is currently t...
indonesia-gen-z-report-2024 Gen Z (born between 1997 and 2012) is currently t...indonesia-gen-z-report-2024 Gen Z (born between 1997 and 2012) is currently t...
indonesia-gen-z-report-2024 Gen Z (born between 1997 and 2012) is currently t...
disnakertransjabarda
 
Publication-launch-How-is-Life-for-Children-in-the-Digital-Age-15-May-2025.pdf
Publication-launch-How-is-Life-for-Children-in-the-Digital-Age-15-May-2025.pdfPublication-launch-How-is-Life-for-Children-in-the-Digital-Age-15-May-2025.pdf
Publication-launch-How-is-Life-for-Children-in-the-Digital-Age-15-May-2025.pdf
StatsCommunications
 
CERTIFIED BUSINESS ANALYSIS PROFESSIONAL™
CERTIFIED BUSINESS ANALYSIS PROFESSIONAL™CERTIFIED BUSINESS ANALYSIS PROFESSIONAL™
CERTIFIED BUSINESS ANALYSIS PROFESSIONAL™
muhammed84essa
 
Feature Engineering for Electronic Health Record Systems
Feature Engineering for Electronic Health Record SystemsFeature Engineering for Electronic Health Record Systems
Feature Engineering for Electronic Health Record Systems
Process mining Evangelist
 
CS-404 COA COURSE FILE JAN JUN 2025.docx
CS-404 COA COURSE FILE JAN JUN 2025.docxCS-404 COA COURSE FILE JAN JUN 2025.docx
CS-404 COA COURSE FILE JAN JUN 2025.docx
nidarizvitit
 
录取通知书加拿大TMU毕业证多伦多都会大学电子版毕业证成绩单
录取通知书加拿大TMU毕业证多伦多都会大学电子版毕业证成绩单录取通知书加拿大TMU毕业证多伦多都会大学电子版毕业证成绩单
录取通知书加拿大TMU毕业证多伦多都会大学电子版毕业证成绩单
Taqyea
 
Transforming health care with ai powered
Transforming health care with ai poweredTransforming health care with ai powered
Transforming health care with ai powered
gowthamarvj
 
AWS-Certified-ML-Engineer-Associate-Slides.pdf
AWS-Certified-ML-Engineer-Associate-Slides.pdfAWS-Certified-ML-Engineer-Associate-Slides.pdf
AWS-Certified-ML-Engineer-Associate-Slides.pdf
philsparkshome
 
Lagos School of Programming Final Project Updated.pdf
Lagos School of Programming Final Project Updated.pdfLagos School of Programming Final Project Updated.pdf
Lagos School of Programming Final Project Updated.pdf
benuju2016
 
Controlling Financial Processes at a Municipality
Controlling Financial Processes at a MunicipalityControlling Financial Processes at a Municipality
Controlling Financial Processes at a Municipality
Process mining Evangelist
 
文凭证书美国SDSU文凭圣地亚哥州立大学学生证学历认证查询
文凭证书美国SDSU文凭圣地亚哥州立大学学生证学历认证查询文凭证书美国SDSU文凭圣地亚哥州立大学学生证学历认证查询
文凭证书美国SDSU文凭圣地亚哥州立大学学生证学历认证查询
Taqyea
 
Automated Melanoma Detection via Image Processing.pptx
Automated Melanoma Detection via Image Processing.pptxAutomated Melanoma Detection via Image Processing.pptx
Automated Melanoma Detection via Image Processing.pptx
handrymaharjan23
 
hersh's midterm project.pdf music retail and distribution
hersh's midterm project.pdf music retail and distributionhersh's midterm project.pdf music retail and distribution
hersh's midterm project.pdf music retail and distribution
hershtara1
 
TOAE201-Slides-Chapter 4. Sample theoretical basis (1).pdf
TOAE201-Slides-Chapter 4. Sample theoretical basis (1).pdfTOAE201-Slides-Chapter 4. Sample theoretical basis (1).pdf
TOAE201-Slides-Chapter 4. Sample theoretical basis (1).pdf
NhiV747372
 
L1_Slides_Foundational Concepts_508.pptx
L1_Slides_Foundational Concepts_508.pptxL1_Slides_Foundational Concepts_508.pptx
L1_Slides_Foundational Concepts_508.pptx
38NoopurPatel
 
2024 Digital Equity Accelerator Report.pdf
2024 Digital Equity Accelerator Report.pdf2024 Digital Equity Accelerator Report.pdf
2024 Digital Equity Accelerator Report.pdf
dominikamizerska1
 
Mining a Global Trade Process with Data Science - Microsoft
Mining a Global Trade Process with Data Science - MicrosoftMining a Global Trade Process with Data Science - Microsoft
Mining a Global Trade Process with Data Science - Microsoft
Process mining Evangelist
 
AWS Certified Machine Learning Slides.pdf
AWS Certified Machine Learning Slides.pdfAWS Certified Machine Learning Slides.pdf
AWS Certified Machine Learning Slides.pdf
philsparkshome
 
How to Set Up Process Mining in a Decentralized Organization?
How to Set Up Process Mining in a Decentralized Organization?How to Set Up Process Mining in a Decentralized Organization?
How to Set Up Process Mining in a Decentralized Organization?
Process mining Evangelist
 
problem solving.presentation slideshow bsc nursing
problem solving.presentation slideshow bsc nursingproblem solving.presentation slideshow bsc nursing
problem solving.presentation slideshow bsc nursing
vishnudathas123
 
indonesia-gen-z-report-2024 Gen Z (born between 1997 and 2012) is currently t...
indonesia-gen-z-report-2024 Gen Z (born between 1997 and 2012) is currently t...indonesia-gen-z-report-2024 Gen Z (born between 1997 and 2012) is currently t...
indonesia-gen-z-report-2024 Gen Z (born between 1997 and 2012) is currently t...
disnakertransjabarda
 
Publication-launch-How-is-Life-for-Children-in-the-Digital-Age-15-May-2025.pdf
Publication-launch-How-is-Life-for-Children-in-the-Digital-Age-15-May-2025.pdfPublication-launch-How-is-Life-for-Children-in-the-Digital-Age-15-May-2025.pdf
Publication-launch-How-is-Life-for-Children-in-the-Digital-Age-15-May-2025.pdf
StatsCommunications
 
CERTIFIED BUSINESS ANALYSIS PROFESSIONAL™
CERTIFIED BUSINESS ANALYSIS PROFESSIONAL™CERTIFIED BUSINESS ANALYSIS PROFESSIONAL™
CERTIFIED BUSINESS ANALYSIS PROFESSIONAL™
muhammed84essa
 
Feature Engineering for Electronic Health Record Systems
Feature Engineering for Electronic Health Record SystemsFeature Engineering for Electronic Health Record Systems
Feature Engineering for Electronic Health Record Systems
Process mining Evangelist
 
CS-404 COA COURSE FILE JAN JUN 2025.docx
CS-404 COA COURSE FILE JAN JUN 2025.docxCS-404 COA COURSE FILE JAN JUN 2025.docx
CS-404 COA COURSE FILE JAN JUN 2025.docx
nidarizvitit
 
录取通知书加拿大TMU毕业证多伦多都会大学电子版毕业证成绩单
录取通知书加拿大TMU毕业证多伦多都会大学电子版毕业证成绩单录取通知书加拿大TMU毕业证多伦多都会大学电子版毕业证成绩单
录取通知书加拿大TMU毕业证多伦多都会大学电子版毕业证成绩单
Taqyea
 
Transforming health care with ai powered
Transforming health care with ai poweredTransforming health care with ai powered
Transforming health care with ai powered
gowthamarvj
 
AWS-Certified-ML-Engineer-Associate-Slides.pdf
AWS-Certified-ML-Engineer-Associate-Slides.pdfAWS-Certified-ML-Engineer-Associate-Slides.pdf
AWS-Certified-ML-Engineer-Associate-Slides.pdf
philsparkshome
 
Lagos School of Programming Final Project Updated.pdf
Lagos School of Programming Final Project Updated.pdfLagos School of Programming Final Project Updated.pdf
Lagos School of Programming Final Project Updated.pdf
benuju2016
 
Controlling Financial Processes at a Municipality
Controlling Financial Processes at a MunicipalityControlling Financial Processes at a Municipality
Controlling Financial Processes at a Municipality
Process mining Evangelist
 
文凭证书美国SDSU文凭圣地亚哥州立大学学生证学历认证查询
文凭证书美国SDSU文凭圣地亚哥州立大学学生证学历认证查询文凭证书美国SDSU文凭圣地亚哥州立大学学生证学历认证查询
文凭证书美国SDSU文凭圣地亚哥州立大学学生证学历认证查询
Taqyea
 
Automated Melanoma Detection via Image Processing.pptx
Automated Melanoma Detection via Image Processing.pptxAutomated Melanoma Detection via Image Processing.pptx
Automated Melanoma Detection via Image Processing.pptx
handrymaharjan23
 
hersh's midterm project.pdf music retail and distribution
hersh's midterm project.pdf music retail and distributionhersh's midterm project.pdf music retail and distribution
hersh's midterm project.pdf music retail and distribution
hershtara1
 
TOAE201-Slides-Chapter 4. Sample theoretical basis (1).pdf
TOAE201-Slides-Chapter 4. Sample theoretical basis (1).pdfTOAE201-Slides-Chapter 4. Sample theoretical basis (1).pdf
TOAE201-Slides-Chapter 4. Sample theoretical basis (1).pdf
NhiV747372
 
L1_Slides_Foundational Concepts_508.pptx
L1_Slides_Foundational Concepts_508.pptxL1_Slides_Foundational Concepts_508.pptx
L1_Slides_Foundational Concepts_508.pptx
38NoopurPatel
 
2024 Digital Equity Accelerator Report.pdf
2024 Digital Equity Accelerator Report.pdf2024 Digital Equity Accelerator Report.pdf
2024 Digital Equity Accelerator Report.pdf
dominikamizerska1
 
Mining a Global Trade Process with Data Science - Microsoft
Mining a Global Trade Process with Data Science - MicrosoftMining a Global Trade Process with Data Science - Microsoft
Mining a Global Trade Process with Data Science - Microsoft
Process mining Evangelist
 

Practical data science

  • 2. 2 1. Analyze Datasets and Train ML Models using AutoML
  • 4. 4 Register Data with AWS Glue and Query Data with Athena
  • 6. 6 Statistical Bias and SageMaker Clarify Covariant Drift: distribution of the independent variables or the features can change. Prior Probability Drift: data distribution of your labels or the targeted variables might change. Concept Drift: relationship between the features and the labels can change. Concept drift also called as concept shift can happen when the definition of the label itself changes based on a particular feature like age or geographical location. Measure Class Imbalance (CI) • Measures the imbalance in the number of examples that are provided for different facet values. • Does a particular product category have disproportionately large number of total reviews than any other category in the dataset? Difference in Proportions of Labels (DPL) • Measures the imbalance of positive outcomes between the different facet values. • If a particular product category has disproportionately higher ratings than other categories. Amazon SageMaker Clarify
  • 7. 7 Feature Importance SHAP Rank the individual features in the order of their importance and contribution to the final model. SHAP (SHapley Additive exPlanations) GitHub paper YouTube A game theoretic approach to explain the output of any machine learning model. It connects optimal credit allocation with local explanations using the classic Shapley values from game theory and their related extensions New Data Flow Import Data Add Data Analysis Feature Importance
  • 8. 8 • Auto ML allows for experts to focus on those hard problems that can't be solved through Auto ML. • Auto ML can reduce the repetitive work, experts can apply their domain to analyze the results
  • 9. 9 Automatic data pre-processing and feature engineering • Automatic data pre-processing and feature engineering automatically fills in the missing data, provides statistical insights about columns in your dataset, and automatically extracts information from non-numeric columns, such as date and time information from timestamps. • Automatic ML model selection automatically infers the type of predictions that best suit your data, such as binary classification, multi-class classification, or regression. SageMaker Autopilot then explores high-performing algorithms such as gradient boosting decision tree, feedforward deep neural networks, and logistic regression, and trains and optimizes hundreds of models based on these algorithms to find the model that best fits your data. • Model leaderboard can view the list of models, ranked by metrics such as accuracy, precision, recall, and area under the curve (AUC), review model details such as the impact of features on predictions, and deploy the model that is best suited to your use case.
  • 11. 11 Explore the Use Case and Analyze the Dataset: • AWS Data Wrangler • AWS Glue • Amazon Athena • Matplotlib • Seaborn • Pandas • Numpy Data Bias and Feature Importance: • Measure Pretraining Bias - Amazon SageMaker • SHAP Automated Machine Learning: • Amazon SageMaker Autopilot Built-in algorithms: • Elastic Machine Learning Algorithms in Amazon SageMaker • Word2Vec algorithm • GloVe algorithm • FastText algorithm • Transformer architecture, "Attention Is All You Need" • BlazingText algorithm • ELMo algorithm • GPT model architecture • BERT model architecture • Built-in algorithms • Amazon SageMaker BlazingText
  • 12. 12 2. Build, Train, and Deploy ML Pipelines using BERT
  • 13. 13 • Dataset best fits the algorithm • Improve ML model performance Feature Engineering Steps Feature Engineering Pipeline Split Dataset Feature Engineering
  • 14. 14 BERT Embedding SageMaker Processing with scikit-learn Parameters: code, processingInput, processingOutput
  • 15. 15 Feature Store – Reuse the feature engineering results Centralized Reusable Discoverable
  • 16. 16
  • 17. 17
  • 18. 18
  • 19. 19
  • 20. 20
  • 21. 21
  • 22. 22 Artifact • the output of a step or task can be consumed the next step in a pipeline or deployed directly for consumption SageMaker Pipelines
  • 23. 23
  • 24. 24 Feature Engineering and Feature Store: • RoBERTa: A Robustly Optimized BERT Pretraining Approach • Fundamental Techniques of Feature Engineering for Machine Learning Train, Debug, and Profile a Machine Learning Model: • PyTorch Hub • TensorFlow Hub • Hugging Face open-source NLP transformers library • RoBERTa model • Amazon SageMaker Model Training (Developer Guide) • Amazon SageMaker Debugger: A system for real-time insights into machine learning model training • The science behind SageMaker’s cost-saving Debugger • Amazon SageMaker Debugger (Developer Guide) • Amazon SageMaker Debugger (GitHub) Deploy End-To-End Machine Learning Pipelines: • A Chat with Andrew on MLOps: From Model-centric to Data-centric AI
  • 25. 25 3. Optimize ML Models and Deploy Human-in-the-Loop Pipelines
  • 26. 26
  • 27. 27
  • 28. 28
  • 29. 29
  • 30. 30
  • 31. 31
  • 32. 32
  • 33. 33
  • 34. 34
  • 35. 35
  • 36. 36
  • 37. 37
  • 38. 38
  • 39. 39
  • 40. 40
  • 41. 41 Advanced model training, tuning, and evaluation: • Hyperband • Bayesian Optimization • Amazon SageMaker Automatic Model Tuning Advanced model deployment, and monitoring: • A/B Testing • Autoscaling • Multi-armed bandit • Batch Transform • Inference Pipeline • Model Monitor Data labeling and human-in-the-loop pipelines: • Towards Automated Data Quality Management for Machine Learning • Amazon SageMaker Ground Truth Developer Guide • Create high-quality instructions for Amazon SageMaker Ground Truth labeling jobs • Amazon SageMaker Augmented AI (Amazon A2I) Developer Guide
  翻译: