SlideShare a Scribd company logo
THE DATA MINING
PROCESS
LEARNING OBJECTIVES:
• PROVIDE AN OVERVIEW OF THE EIGHT STEPS IN THE DATA MINING
PROCESS.
• IDENTIFY THE ISSUES INVOLVED IN DEFINING A DATA MINING PROBLEM.
• DETERMINE WHEN TO USE AND NOT TO USE DATA MINING
• EXPLAIN HOW TO CONDUCT AND EXPERIMENT TO DETERMINE WHETHER
MORE DATA IS NEEDED
By Obakeng Brian Pheelwane & Marc Berman – Group 14
8 STEP DATA MINING PROCESS
• Defining the problem
• Collecting data
• Preparing data
• Pre-processing
• Selecting an algorithm
and training parameters
• Training and testing
• Iterating to produce
different models
• Evaluating the final
model
- The next three steps involve collecting the required data, preparing
and pre-processing the data before a data mining technique
could be applied.
-This defines the objective of the whole data mining process.
- Select a model and perform training and testing to evaluate if the model is good.
- Because one does not know how an algorithm will perform on a data set, one
needs to try different models and compare each model against each other
model.
- Due to the previous set, many iterations may be required for all steps in order to
determine the optimal model.
- The best model is selected based on the estimated accuracy. This model is then
used for future predictions.
8 STEP DATA MINING PROCESS
CONTINUED…
Defining the problem:
• One needs to determine which problems are suitable for data-driven modelling.
• How does one evaluate the results?
• Is it a classification or estimation problem?
• What are the inputs and outputs required for solving the problem?
When TO DO data mining:
• When no good existing solution exists & the problem has the following characteristics:
 Lots of data
 The problem is not well understood
When NOT to do data mining:
• When the problem:
 Has a complete, closed-form mathematical solution.
 It is well understood and has a good analytical or rule-based solution.
 The problem can be characterised as an input-to-output relationship
 Existing models have strong and possibly erroneous assumptions
8 STEP DATA MINING PROCESS
CONTINUED…
How do you evaluate the results?
• What level of accuracy would be considered successful?
• How will you benchmark the performance of a developed solution?
• What existing alternatives will you compare against?
• What kind of data will be used to evaluate the various models?
• What will the models be used for and how well do they support that purpose?
Classification or Estimation?
Discrete outputs = classification problem
Continuous outputs = estimation problem
Borderline outputs = can be either based upon the granularity of the outputs.
8 STEP DATA MINING PROCESS
CLASSIFICATION VERSUS ESTIMATION
Classification:
In classification learning, the learning scheme is presented with a set of classified examples from which it is
expected to learn a way of classifying unseen examples.
In plain English, the main idea behind classification is it uses 2 values, it is either yes or no, 1 or 0. It may
belong to a single class not both. The algorithm will then learn how to predict future unseen data based upon
the training data.
Estimation:
In estimation, the classification of the data is not based on an absolute value (IE 0 or 1) but rather on real
numbers between 0 and 1. So what if the value sits between two classes? For example, 0,5 could fall into
either class however if the classes state 0-0,49 and 0,5 to 1 then it would inevitably fall into the later class.
When the model is first designed, a set of predefined classes are laid out in order to prevent a data point from
lying between two or more categories.
WHAT ARE THE INPUTS AND OUTPUTS?
An example: The inputs and outputs when classifying loan application at a bank.
Outputs:
The outputs could possibly be “high risk” or “low risk”
Inputs:
The inputs may current salary, outstanding liabilities, bank account balances, other income, number of years
employed.
IMPORTANT ISSUES
Causal and Non-Causal outputs:
Causal attributes occur when one attribute causes another. Meaning it is used to predict another attribute or it
is included in the calculation. It is important to avoid using non-causal attributes as they produce a model not a
representation of the future data.
Inputs:
The inputs must contain enough information to be able to generate the desired output. If the inputs contain
insufficient data the accuracy of the final model will inevitably decrease.
Data set:
The data chosen must be an accurate representation of the future data set. If it is not the accuracy when trying
to predict the future will produce undesired outputs.
HOW MUCH DATA IS ENOUGH
Key: The amount of data required depends on the problem
complexity as well as the amount of noise in the data.
Each learning algorithm follows its own learning curve, the accuracy
increases as the data size increases, however when the algorithm
reaches its optimal performance further increase of the data set
CANNOT improve performance.
Figure 1:
Algorithm A reaches optimal performance sooner than Algorithm B,
however the accuracy cannot increase after a certain point.
Note: One needs to experiment to determine when the algorithm
reaches it’s optimal performance. Can be determined by:
• Using small portions of the training set to and gradually increasing the
size until the accuracy level stabilises.
Figure 1:
WHAT WE HAVE LEARNED:
• The 8 steps of data mining as well as the need for iteration in order to produce a satisfactory
model.
• The difference between estimation and classification
• When to do data mining and when not to
• Important issues when defining a problem
REVIEW QUESTIONS
1. Explain how are you going to decide whether a given problem is suitable for a data
mining solution.
Given a problem , I am going to check if there is no good existing solution and the problem has the
following characteristics:
• Lots of data
• The problem is not well understood
• The problem can be characterised as an input-to-output relationship
• Existing models have strong and possibly erroneous assumptions
2. Suppose the data provided is the last promotional mail-out records which consist of
information about each of the 100000 customers (name, address, occupation, salary)
and whether each individual customer responded to the mail (i.e., an attribute indicating
“yes” or “no”. You are asked to produce a data mining solution, that is, a model
describing the characteristics of customers who are likely, as well as unlikely, to
respond to the promotional mail-out. The company could then use this model to target
customers who are likely to respond to the next promotional mail-out for the same
product.
Discuss the following issues:
• Is this problem suitable for data mining solution?
Yes
This problem is suitable for data mining solution because there is no good existing solution and it has a lot
of data, that can be easily characterised as an input-to-output relationship.
• Does the information above give us a classification or estimation problem? Justify your answer
Classification problem, the model required is to be used to predict whether a customer is a repeat buyer or not.
i.e. two classes of customers.
• What are the inputs and output?
The inputs are the attributes: name, address, occupation, salary;
The outputs is an attribute called: “Repeat Buyer” with the labels “yes” or “no”
• What is the alternative to producing a model?
As the task is to select a subset of customers to send the promotional mail-out to, instead of building a model
in order to identify the customers, one can simply perform a random selection of customers from all the
available customers
• How you will use the data for training a model and evaluating the model?
In order to evaluate the trained models using test data, the given data set of 100000 customers can be split
into two subsets; one for the training and one for the testing. There is no need to use more elaborate
evaluation method such as 10-fold cross validation method, as the data set is big enough and the reserved for
testing is unlikely to degrade the predictive accuracy of the trained model.
3. Let say you are given a set of training data with 50% class “positive” and 50% class
“negative”, and you have explored several models and selected the best model. You can
now use the best model to do future prediction. Now, you are informed that the future data
you are going to get is likely to have the following class distribution: 90% class “positive”
and 10% class “negative”.
Would you go ahead to use the best model to do prediction for all future data? Provide a reason for your
answer. In the case that your answer is no, you shall also provide an alternative solution to do prediction for
the future data.
The training data is not a representative of the testing data in terms of class distribution; the one has a 1:1
ratio and another has 9:1 ratio. This is why one should not use the model from a data set not representative of
the testing data to do prediction for future data.
If one does, then it is likely to lead to poor performance i.e. high error rate.
The same model applied to 9:1 ratio testing data will perform a lot worse, as 80% of the time the model will
predict a class “positive” regardless of the input. The simple model will have accuracy of 90% on the 9:1 ratio
testing data.
One should be aware of how some models allow one to adjust outputs to match the class distribution of the
testing data. Only with this adjustment, than one can use the model trained from different class distribution to
be applied.
4. How does one decide whether to collect more data or not in a non-time series data
mining task?
When there is a small portion of data to mine, one should collect more data.
Where the is a lot of data one should not collect more data is there is sufficient data to mine.
In both situations time is not a factor, everything is based on the quantity of data required to mine a data
mining task.
Ad

More Related Content

What's hot (20)

Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data Mining
DataminingTools Inc
 
Data Integration and Transformation in Data mining
Data Integration and Transformation in Data miningData Integration and Transformation in Data mining
Data Integration and Transformation in Data mining
kavitha muneeshwaran
 
Data preparation
Data preparationData preparation
Data preparation
Tony Nguyen
 
Clustering
ClusteringClustering
Clustering
M Rizwan Aqeel
 
Data mining
Data miningData mining
Data mining
Akannsha Totewar
 
5.1 mining data streams
5.1 mining data streams5.1 mining data streams
5.1 mining data streams
Krish_ver2
 
Data mining
Data miningData mining
Data mining
pradeepa n
 
Classification in data mining
Classification in data mining Classification in data mining
Classification in data mining
Sulman Ahmed
 
Data Mining
Data MiningData Mining
Data Mining
SHIKHA GAUTAM
 
Data Mining: Data Preprocessing
Data Mining: Data PreprocessingData Mining: Data Preprocessing
Data Mining: Data Preprocessing
Lakshmi Sarvani Videla
 
Introduction to Web Mining and Spatial Data Mining
Introduction to Web Mining and Spatial Data MiningIntroduction to Web Mining and Spatial Data Mining
Introduction to Web Mining and Spatial Data Mining
AarshDhokai
 
Data Mining: Application and trends in data mining
Data Mining: Application and trends in data miningData Mining: Application and trends in data mining
Data Mining: Application and trends in data mining
DataminingTools Inc
 
Knowledge Discovery and Data Mining
Knowledge Discovery and Data MiningKnowledge Discovery and Data Mining
Knowledge Discovery and Data Mining
Amritanshu Mehra
 
Artificial Neural Networks for Data Mining
Artificial Neural Networks for Data MiningArtificial Neural Networks for Data Mining
Artificial Neural Networks for Data Mining
Amity University | FMS - DU | IMT | Stratford University | KKMI International Institute | AIMA | DTU
 
Data mining techniques unit 1
Data mining techniques  unit 1Data mining techniques  unit 1
Data mining techniques unit 1
malathieswaran29
 
Firewall Basing
Firewall BasingFirewall Basing
Firewall Basing
Pina Parmar
 
Decision Trees
Decision TreesDecision Trees
Decision Trees
Student
 
Data Preprocessing
Data PreprocessingData Preprocessing
Data Preprocessing
Object-Frontier Software Pvt. Ltd
 
Data mining
Data miningData mining
Data mining
Annies Minu
 
Chapter - 5 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 5 Data Mining Concepts and Techniques 2nd Ed slides Han & KamberChapter - 5 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 5 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
error007
 
Data Integration and Transformation in Data mining
Data Integration and Transformation in Data miningData Integration and Transformation in Data mining
Data Integration and Transformation in Data mining
kavitha muneeshwaran
 
Data preparation
Data preparationData preparation
Data preparation
Tony Nguyen
 
5.1 mining data streams
5.1 mining data streams5.1 mining data streams
5.1 mining data streams
Krish_ver2
 
Classification in data mining
Classification in data mining Classification in data mining
Classification in data mining
Sulman Ahmed
 
Introduction to Web Mining and Spatial Data Mining
Introduction to Web Mining and Spatial Data MiningIntroduction to Web Mining and Spatial Data Mining
Introduction to Web Mining and Spatial Data Mining
AarshDhokai
 
Data Mining: Application and trends in data mining
Data Mining: Application and trends in data miningData Mining: Application and trends in data mining
Data Mining: Application and trends in data mining
DataminingTools Inc
 
Knowledge Discovery and Data Mining
Knowledge Discovery and Data MiningKnowledge Discovery and Data Mining
Knowledge Discovery and Data Mining
Amritanshu Mehra
 
Data mining techniques unit 1
Data mining techniques  unit 1Data mining techniques  unit 1
Data mining techniques unit 1
malathieswaran29
 
Decision Trees
Decision TreesDecision Trees
Decision Trees
Student
 
Chapter - 5 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 5 Data Mining Concepts and Techniques 2nd Ed slides Han & KamberChapter - 5 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 5 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
error007
 

Similar to The 8 Step Data Mining Process (20)

Barga Data Science lecture 10
Barga Data Science lecture 10Barga Data Science lecture 10
Barga Data Science lecture 10
Roger Barga
 
Statistical Learning and Model Selection (1).pptx
Statistical Learning and Model Selection (1).pptxStatistical Learning and Model Selection (1).pptx
Statistical Learning and Model Selection (1).pptx
rajalakshmi5921
 
Statistical Learning and Model Selection module 2.pptx
Statistical Learning and Model Selection module 2.pptxStatistical Learning and Model Selection module 2.pptx
Statistical Learning and Model Selection module 2.pptx
nagarajan740445
 
AI-900 - Fundamental Principles of ML.pptx
AI-900 - Fundamental Principles of ML.pptxAI-900 - Fundamental Principles of ML.pptx
AI-900 - Fundamental Principles of ML.pptx
kprasad8
 
KPMG_Task2.pptx
KPMG_Task2.pptxKPMG_Task2.pptx
KPMG_Task2.pptx
hendrakuswantoro2
 
Chapter 4 Classification in data sience .pdf
Chapter 4 Classification in data sience .pdfChapter 4 Classification in data sience .pdf
Chapter 4 Classification in data sience .pdf
AschalewAyele2
 
Machine learning basics using python programking
Machine learning basics using python programkingMachine learning basics using python programking
Machine learning basics using python programking
Anupamasindgi
 
Lesson 1 - Overview of Machine Learning and Data Analysis.pptx
Lesson 1 - Overview of Machine Learning and Data Analysis.pptxLesson 1 - Overview of Machine Learning and Data Analysis.pptx
Lesson 1 - Overview of Machine Learning and Data Analysis.pptx
cloudserviceuit
 
Data mining - Machine Learning
Data mining - Machine LearningData mining - Machine Learning
Data mining - Machine Learning
RupaDutta3
 
Analytics
AnalyticsAnalytics
Analytics
Vishnu Rajendran C R
 
Barga Data Science lecture 2
Barga Data Science lecture 2Barga Data Science lecture 2
Barga Data Science lecture 2
Roger Barga
 
Top 10 Data Science Practioner Pitfalls - Mark Landry
Top 10 Data Science Practioner Pitfalls - Mark LandryTop 10 Data Science Practioner Pitfalls - Mark Landry
Top 10 Data Science Practioner Pitfalls - Mark Landry
Sri Ambati
 
Introduction to Business Analytics---PPT
Introduction to Business Analytics---PPTIntroduction to Business Analytics---PPT
Introduction to Business Analytics---PPT
Neerupa Chauhan
 
Top 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner PitfallsTop 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner Pitfalls
Sri Ambati
 
Top 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner PitfallsTop 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner Pitfalls
Sri Ambati
 
Predictive Analytics in Practice
Predictive Analytics in PracticePredictive Analytics in Practice
Predictive Analytics in Practice
Hobsons
 
Modelling and evaluation
Modelling and evaluationModelling and evaluation
Modelling and evaluation
eShikshak
 
Supervised learning
Supervised learningSupervised learning
Supervised learning
Alia Hamwi
 
Hair_EOMA_1e_Chap001_PPT.pptx
Hair_EOMA_1e_Chap001_PPT.pptxHair_EOMA_1e_Chap001_PPT.pptx
Hair_EOMA_1e_Chap001_PPT.pptx
AsadAli104515
 
Descriptive Statistics and Interpretation Grading GuideQNT5.docx
Descriptive Statistics and Interpretation Grading GuideQNT5.docxDescriptive Statistics and Interpretation Grading GuideQNT5.docx
Descriptive Statistics and Interpretation Grading GuideQNT5.docx
theodorelove43763
 
Barga Data Science lecture 10
Barga Data Science lecture 10Barga Data Science lecture 10
Barga Data Science lecture 10
Roger Barga
 
Statistical Learning and Model Selection (1).pptx
Statistical Learning and Model Selection (1).pptxStatistical Learning and Model Selection (1).pptx
Statistical Learning and Model Selection (1).pptx
rajalakshmi5921
 
Statistical Learning and Model Selection module 2.pptx
Statistical Learning and Model Selection module 2.pptxStatistical Learning and Model Selection module 2.pptx
Statistical Learning and Model Selection module 2.pptx
nagarajan740445
 
AI-900 - Fundamental Principles of ML.pptx
AI-900 - Fundamental Principles of ML.pptxAI-900 - Fundamental Principles of ML.pptx
AI-900 - Fundamental Principles of ML.pptx
kprasad8
 
Chapter 4 Classification in data sience .pdf
Chapter 4 Classification in data sience .pdfChapter 4 Classification in data sience .pdf
Chapter 4 Classification in data sience .pdf
AschalewAyele2
 
Machine learning basics using python programking
Machine learning basics using python programkingMachine learning basics using python programking
Machine learning basics using python programking
Anupamasindgi
 
Lesson 1 - Overview of Machine Learning and Data Analysis.pptx
Lesson 1 - Overview of Machine Learning and Data Analysis.pptxLesson 1 - Overview of Machine Learning and Data Analysis.pptx
Lesson 1 - Overview of Machine Learning and Data Analysis.pptx
cloudserviceuit
 
Data mining - Machine Learning
Data mining - Machine LearningData mining - Machine Learning
Data mining - Machine Learning
RupaDutta3
 
Barga Data Science lecture 2
Barga Data Science lecture 2Barga Data Science lecture 2
Barga Data Science lecture 2
Roger Barga
 
Top 10 Data Science Practioner Pitfalls - Mark Landry
Top 10 Data Science Practioner Pitfalls - Mark LandryTop 10 Data Science Practioner Pitfalls - Mark Landry
Top 10 Data Science Practioner Pitfalls - Mark Landry
Sri Ambati
 
Introduction to Business Analytics---PPT
Introduction to Business Analytics---PPTIntroduction to Business Analytics---PPT
Introduction to Business Analytics---PPT
Neerupa Chauhan
 
Top 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner PitfallsTop 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner Pitfalls
Sri Ambati
 
Top 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner PitfallsTop 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner Pitfalls
Sri Ambati
 
Predictive Analytics in Practice
Predictive Analytics in PracticePredictive Analytics in Practice
Predictive Analytics in Practice
Hobsons
 
Modelling and evaluation
Modelling and evaluationModelling and evaluation
Modelling and evaluation
eShikshak
 
Supervised learning
Supervised learningSupervised learning
Supervised learning
Alia Hamwi
 
Hair_EOMA_1e_Chap001_PPT.pptx
Hair_EOMA_1e_Chap001_PPT.pptxHair_EOMA_1e_Chap001_PPT.pptx
Hair_EOMA_1e_Chap001_PPT.pptx
AsadAli104515
 
Descriptive Statistics and Interpretation Grading GuideQNT5.docx
Descriptive Statistics and Interpretation Grading GuideQNT5.docxDescriptive Statistics and Interpretation Grading GuideQNT5.docx
Descriptive Statistics and Interpretation Grading GuideQNT5.docx
theodorelove43763
 
Ad

Recently uploaded (20)

Slack like a pro: strategies for 10x engineering teams
Slack like a pro: strategies for 10x engineering teamsSlack like a pro: strategies for 10x engineering teams
Slack like a pro: strategies for 10x engineering teams
Nacho Cougil
 
AI needs Hybrid Cloud - TEC conference 2025.pptx
AI needs Hybrid Cloud - TEC conference 2025.pptxAI needs Hybrid Cloud - TEC conference 2025.pptx
AI needs Hybrid Cloud - TEC conference 2025.pptx
Shikha Srivastava
 
Computer Systems Quiz Presentation in Purple Bold Style (4).pdf
Computer Systems Quiz Presentation in Purple Bold Style (4).pdfComputer Systems Quiz Presentation in Purple Bold Style (4).pdf
Computer Systems Quiz Presentation in Purple Bold Style (4).pdf
fizarcse
 
MULTI-STAKEHOLDER CONSULTATION PROGRAM On Implementation of DNF 2.0 and Way F...
MULTI-STAKEHOLDER CONSULTATION PROGRAM On Implementation of DNF 2.0 and Way F...MULTI-STAKEHOLDER CONSULTATION PROGRAM On Implementation of DNF 2.0 and Way F...
MULTI-STAKEHOLDER CONSULTATION PROGRAM On Implementation of DNF 2.0 and Way F...
ICT Frame Magazine Pvt. Ltd.
 
Risk Analysis 101: Using a Risk Analyst to Fortify Your IT Strategy
Risk Analysis 101: Using a Risk Analyst to Fortify Your IT StrategyRisk Analysis 101: Using a Risk Analyst to Fortify Your IT Strategy
Risk Analysis 101: Using a Risk Analyst to Fortify Your IT Strategy
john823664
 
Design pattern talk by Kaya Weers - 2025 (v2)
Design pattern talk by Kaya Weers - 2025 (v2)Design pattern talk by Kaya Weers - 2025 (v2)
Design pattern talk by Kaya Weers - 2025 (v2)
Kaya Weers
 
How Top Companies Benefit from Outsourcing
How Top Companies Benefit from OutsourcingHow Top Companies Benefit from Outsourcing
How Top Companies Benefit from Outsourcing
Nascenture
 
Crazy Incentives and How They Kill Security. How Do You Turn the Wheel?
Crazy Incentives and How They Kill Security. How Do You Turn the Wheel?Crazy Incentives and How They Kill Security. How Do You Turn the Wheel?
Crazy Incentives and How They Kill Security. How Do You Turn the Wheel?
Christian Folini
 
Longitudinal Benchmark: A Real-World UX Case Study in Onboarding by Linda Bor...
Longitudinal Benchmark: A Real-World UX Case Study in Onboarding by Linda Bor...Longitudinal Benchmark: A Real-World UX Case Study in Onboarding by Linda Bor...
Longitudinal Benchmark: A Real-World UX Case Study in Onboarding by Linda Bor...
UXPA Boston
 
OpenAI Just Announced Codex: A cloud engineering agent that excels in handlin...
OpenAI Just Announced Codex: A cloud engineering agent that excels in handlin...OpenAI Just Announced Codex: A cloud engineering agent that excels in handlin...
OpenAI Just Announced Codex: A cloud engineering agent that excels in handlin...
SOFTTECHHUB
 
Google DeepMind’s New AI Coding Agent AlphaEvolve.pdf
Google DeepMind’s New AI Coding Agent AlphaEvolve.pdfGoogle DeepMind’s New AI Coding Agent AlphaEvolve.pdf
Google DeepMind’s New AI Coding Agent AlphaEvolve.pdf
derrickjswork
 
Best 10 Free AI Character Chat Platforms
Best 10 Free AI Character Chat PlatformsBest 10 Free AI Character Chat Platforms
Best 10 Free AI Character Chat Platforms
Soulmaite
 
Shoehorning dependency injection into a FP language, what does it take?
Shoehorning dependency injection into a FP language, what does it take?Shoehorning dependency injection into a FP language, what does it take?
Shoehorning dependency injection into a FP language, what does it take?
Eric Torreborre
 
DNF 2.0 Implementations Challenges in Nepal
DNF 2.0 Implementations Challenges in NepalDNF 2.0 Implementations Challenges in Nepal
DNF 2.0 Implementations Challenges in Nepal
ICT Frame Magazine Pvt. Ltd.
 
Bridging AI and Human Expertise: Designing for Trust and Adoption in Expert S...
Bridging AI and Human Expertise: Designing for Trust and Adoption in Expert S...Bridging AI and Human Expertise: Designing for Trust and Adoption in Expert S...
Bridging AI and Human Expertise: Designing for Trust and Adoption in Expert S...
UXPA Boston
 
Building a research repository that works by Clare Cady
Building a research repository that works by Clare CadyBuilding a research repository that works by Clare Cady
Building a research repository that works by Clare Cady
UXPA Boston
 
Top 5 Qualities to Look for in Salesforce Partners in 2025
Top 5 Qualities to Look for in Salesforce Partners in 2025Top 5 Qualities to Look for in Salesforce Partners in 2025
Top 5 Qualities to Look for in Salesforce Partners in 2025
Damco Salesforce Services
 
Refactoring meta-rauc-community: Cleaner Code, Better Maintenance, More Machines
Refactoring meta-rauc-community: Cleaner Code, Better Maintenance, More MachinesRefactoring meta-rauc-community: Cleaner Code, Better Maintenance, More Machines
Refactoring meta-rauc-community: Cleaner Code, Better Maintenance, More Machines
Leon Anavi
 
Cybersecurity Tools and Technologies - Microsoft Certificate
Cybersecurity Tools and Technologies - Microsoft CertificateCybersecurity Tools and Technologies - Microsoft Certificate
Cybersecurity Tools and Technologies - Microsoft Certificate
VICTOR MAESTRE RAMIREZ
 
UX for Data Engineers and Analysts-Designing User-Friendly Dashboards for Non...
UX for Data Engineers and Analysts-Designing User-Friendly Dashboards for Non...UX for Data Engineers and Analysts-Designing User-Friendly Dashboards for Non...
UX for Data Engineers and Analysts-Designing User-Friendly Dashboards for Non...
UXPA Boston
 
Slack like a pro: strategies for 10x engineering teams
Slack like a pro: strategies for 10x engineering teamsSlack like a pro: strategies for 10x engineering teams
Slack like a pro: strategies for 10x engineering teams
Nacho Cougil
 
AI needs Hybrid Cloud - TEC conference 2025.pptx
AI needs Hybrid Cloud - TEC conference 2025.pptxAI needs Hybrid Cloud - TEC conference 2025.pptx
AI needs Hybrid Cloud - TEC conference 2025.pptx
Shikha Srivastava
 
Computer Systems Quiz Presentation in Purple Bold Style (4).pdf
Computer Systems Quiz Presentation in Purple Bold Style (4).pdfComputer Systems Quiz Presentation in Purple Bold Style (4).pdf
Computer Systems Quiz Presentation in Purple Bold Style (4).pdf
fizarcse
 
MULTI-STAKEHOLDER CONSULTATION PROGRAM On Implementation of DNF 2.0 and Way F...
MULTI-STAKEHOLDER CONSULTATION PROGRAM On Implementation of DNF 2.0 and Way F...MULTI-STAKEHOLDER CONSULTATION PROGRAM On Implementation of DNF 2.0 and Way F...
MULTI-STAKEHOLDER CONSULTATION PROGRAM On Implementation of DNF 2.0 and Way F...
ICT Frame Magazine Pvt. Ltd.
 
Risk Analysis 101: Using a Risk Analyst to Fortify Your IT Strategy
Risk Analysis 101: Using a Risk Analyst to Fortify Your IT StrategyRisk Analysis 101: Using a Risk Analyst to Fortify Your IT Strategy
Risk Analysis 101: Using a Risk Analyst to Fortify Your IT Strategy
john823664
 
Design pattern talk by Kaya Weers - 2025 (v2)
Design pattern talk by Kaya Weers - 2025 (v2)Design pattern talk by Kaya Weers - 2025 (v2)
Design pattern talk by Kaya Weers - 2025 (v2)
Kaya Weers
 
How Top Companies Benefit from Outsourcing
How Top Companies Benefit from OutsourcingHow Top Companies Benefit from Outsourcing
How Top Companies Benefit from Outsourcing
Nascenture
 
Crazy Incentives and How They Kill Security. How Do You Turn the Wheel?
Crazy Incentives and How They Kill Security. How Do You Turn the Wheel?Crazy Incentives and How They Kill Security. How Do You Turn the Wheel?
Crazy Incentives and How They Kill Security. How Do You Turn the Wheel?
Christian Folini
 
Longitudinal Benchmark: A Real-World UX Case Study in Onboarding by Linda Bor...
Longitudinal Benchmark: A Real-World UX Case Study in Onboarding by Linda Bor...Longitudinal Benchmark: A Real-World UX Case Study in Onboarding by Linda Bor...
Longitudinal Benchmark: A Real-World UX Case Study in Onboarding by Linda Bor...
UXPA Boston
 
OpenAI Just Announced Codex: A cloud engineering agent that excels in handlin...
OpenAI Just Announced Codex: A cloud engineering agent that excels in handlin...OpenAI Just Announced Codex: A cloud engineering agent that excels in handlin...
OpenAI Just Announced Codex: A cloud engineering agent that excels in handlin...
SOFTTECHHUB
 
Google DeepMind’s New AI Coding Agent AlphaEvolve.pdf
Google DeepMind’s New AI Coding Agent AlphaEvolve.pdfGoogle DeepMind’s New AI Coding Agent AlphaEvolve.pdf
Google DeepMind’s New AI Coding Agent AlphaEvolve.pdf
derrickjswork
 
Best 10 Free AI Character Chat Platforms
Best 10 Free AI Character Chat PlatformsBest 10 Free AI Character Chat Platforms
Best 10 Free AI Character Chat Platforms
Soulmaite
 
Shoehorning dependency injection into a FP language, what does it take?
Shoehorning dependency injection into a FP language, what does it take?Shoehorning dependency injection into a FP language, what does it take?
Shoehorning dependency injection into a FP language, what does it take?
Eric Torreborre
 
Bridging AI and Human Expertise: Designing for Trust and Adoption in Expert S...
Bridging AI and Human Expertise: Designing for Trust and Adoption in Expert S...Bridging AI and Human Expertise: Designing for Trust and Adoption in Expert S...
Bridging AI and Human Expertise: Designing for Trust and Adoption in Expert S...
UXPA Boston
 
Building a research repository that works by Clare Cady
Building a research repository that works by Clare CadyBuilding a research repository that works by Clare Cady
Building a research repository that works by Clare Cady
UXPA Boston
 
Top 5 Qualities to Look for in Salesforce Partners in 2025
Top 5 Qualities to Look for in Salesforce Partners in 2025Top 5 Qualities to Look for in Salesforce Partners in 2025
Top 5 Qualities to Look for in Salesforce Partners in 2025
Damco Salesforce Services
 
Refactoring meta-rauc-community: Cleaner Code, Better Maintenance, More Machines
Refactoring meta-rauc-community: Cleaner Code, Better Maintenance, More MachinesRefactoring meta-rauc-community: Cleaner Code, Better Maintenance, More Machines
Refactoring meta-rauc-community: Cleaner Code, Better Maintenance, More Machines
Leon Anavi
 
Cybersecurity Tools and Technologies - Microsoft Certificate
Cybersecurity Tools and Technologies - Microsoft CertificateCybersecurity Tools and Technologies - Microsoft Certificate
Cybersecurity Tools and Technologies - Microsoft Certificate
VICTOR MAESTRE RAMIREZ
 
UX for Data Engineers and Analysts-Designing User-Friendly Dashboards for Non...
UX for Data Engineers and Analysts-Designing User-Friendly Dashboards for Non...UX for Data Engineers and Analysts-Designing User-Friendly Dashboards for Non...
UX for Data Engineers and Analysts-Designing User-Friendly Dashboards for Non...
UXPA Boston
 
Ad

The 8 Step Data Mining Process

  • 1. THE DATA MINING PROCESS LEARNING OBJECTIVES: • PROVIDE AN OVERVIEW OF THE EIGHT STEPS IN THE DATA MINING PROCESS. • IDENTIFY THE ISSUES INVOLVED IN DEFINING A DATA MINING PROBLEM. • DETERMINE WHEN TO USE AND NOT TO USE DATA MINING • EXPLAIN HOW TO CONDUCT AND EXPERIMENT TO DETERMINE WHETHER MORE DATA IS NEEDED By Obakeng Brian Pheelwane & Marc Berman – Group 14
  • 2. 8 STEP DATA MINING PROCESS • Defining the problem • Collecting data • Preparing data • Pre-processing • Selecting an algorithm and training parameters • Training and testing • Iterating to produce different models • Evaluating the final model - The next three steps involve collecting the required data, preparing and pre-processing the data before a data mining technique could be applied. -This defines the objective of the whole data mining process. - Select a model and perform training and testing to evaluate if the model is good. - Because one does not know how an algorithm will perform on a data set, one needs to try different models and compare each model against each other model. - Due to the previous set, many iterations may be required for all steps in order to determine the optimal model. - The best model is selected based on the estimated accuracy. This model is then used for future predictions.
  • 3. 8 STEP DATA MINING PROCESS CONTINUED… Defining the problem: • One needs to determine which problems are suitable for data-driven modelling. • How does one evaluate the results? • Is it a classification or estimation problem? • What are the inputs and outputs required for solving the problem? When TO DO data mining: • When no good existing solution exists & the problem has the following characteristics:  Lots of data  The problem is not well understood When NOT to do data mining: • When the problem:  Has a complete, closed-form mathematical solution.  It is well understood and has a good analytical or rule-based solution.  The problem can be characterised as an input-to-output relationship  Existing models have strong and possibly erroneous assumptions
  • 4. 8 STEP DATA MINING PROCESS CONTINUED… How do you evaluate the results? • What level of accuracy would be considered successful? • How will you benchmark the performance of a developed solution? • What existing alternatives will you compare against? • What kind of data will be used to evaluate the various models? • What will the models be used for and how well do they support that purpose? Classification or Estimation? Discrete outputs = classification problem Continuous outputs = estimation problem Borderline outputs = can be either based upon the granularity of the outputs.
  • 5. 8 STEP DATA MINING PROCESS
  • 6. CLASSIFICATION VERSUS ESTIMATION Classification: In classification learning, the learning scheme is presented with a set of classified examples from which it is expected to learn a way of classifying unseen examples. In plain English, the main idea behind classification is it uses 2 values, it is either yes or no, 1 or 0. It may belong to a single class not both. The algorithm will then learn how to predict future unseen data based upon the training data. Estimation: In estimation, the classification of the data is not based on an absolute value (IE 0 or 1) but rather on real numbers between 0 and 1. So what if the value sits between two classes? For example, 0,5 could fall into either class however if the classes state 0-0,49 and 0,5 to 1 then it would inevitably fall into the later class. When the model is first designed, a set of predefined classes are laid out in order to prevent a data point from lying between two or more categories.
  • 7. WHAT ARE THE INPUTS AND OUTPUTS? An example: The inputs and outputs when classifying loan application at a bank. Outputs: The outputs could possibly be “high risk” or “low risk” Inputs: The inputs may current salary, outstanding liabilities, bank account balances, other income, number of years employed.
  • 8. IMPORTANT ISSUES Causal and Non-Causal outputs: Causal attributes occur when one attribute causes another. Meaning it is used to predict another attribute or it is included in the calculation. It is important to avoid using non-causal attributes as they produce a model not a representation of the future data. Inputs: The inputs must contain enough information to be able to generate the desired output. If the inputs contain insufficient data the accuracy of the final model will inevitably decrease. Data set: The data chosen must be an accurate representation of the future data set. If it is not the accuracy when trying to predict the future will produce undesired outputs.
  • 9. HOW MUCH DATA IS ENOUGH Key: The amount of data required depends on the problem complexity as well as the amount of noise in the data. Each learning algorithm follows its own learning curve, the accuracy increases as the data size increases, however when the algorithm reaches its optimal performance further increase of the data set CANNOT improve performance. Figure 1: Algorithm A reaches optimal performance sooner than Algorithm B, however the accuracy cannot increase after a certain point. Note: One needs to experiment to determine when the algorithm reaches it’s optimal performance. Can be determined by: • Using small portions of the training set to and gradually increasing the size until the accuracy level stabilises. Figure 1:
  • 10. WHAT WE HAVE LEARNED: • The 8 steps of data mining as well as the need for iteration in order to produce a satisfactory model. • The difference between estimation and classification • When to do data mining and when not to • Important issues when defining a problem
  • 11. REVIEW QUESTIONS 1. Explain how are you going to decide whether a given problem is suitable for a data mining solution. Given a problem , I am going to check if there is no good existing solution and the problem has the following characteristics: • Lots of data • The problem is not well understood • The problem can be characterised as an input-to-output relationship • Existing models have strong and possibly erroneous assumptions
  • 12. 2. Suppose the data provided is the last promotional mail-out records which consist of information about each of the 100000 customers (name, address, occupation, salary) and whether each individual customer responded to the mail (i.e., an attribute indicating “yes” or “no”. You are asked to produce a data mining solution, that is, a model describing the characteristics of customers who are likely, as well as unlikely, to respond to the promotional mail-out. The company could then use this model to target customers who are likely to respond to the next promotional mail-out for the same product. Discuss the following issues: • Is this problem suitable for data mining solution? Yes This problem is suitable for data mining solution because there is no good existing solution and it has a lot of data, that can be easily characterised as an input-to-output relationship.
  • 13. • Does the information above give us a classification or estimation problem? Justify your answer Classification problem, the model required is to be used to predict whether a customer is a repeat buyer or not. i.e. two classes of customers. • What are the inputs and output? The inputs are the attributes: name, address, occupation, salary; The outputs is an attribute called: “Repeat Buyer” with the labels “yes” or “no” • What is the alternative to producing a model? As the task is to select a subset of customers to send the promotional mail-out to, instead of building a model in order to identify the customers, one can simply perform a random selection of customers from all the available customers • How you will use the data for training a model and evaluating the model? In order to evaluate the trained models using test data, the given data set of 100000 customers can be split into two subsets; one for the training and one for the testing. There is no need to use more elaborate evaluation method such as 10-fold cross validation method, as the data set is big enough and the reserved for testing is unlikely to degrade the predictive accuracy of the trained model.
  • 14. 3. Let say you are given a set of training data with 50% class “positive” and 50% class “negative”, and you have explored several models and selected the best model. You can now use the best model to do future prediction. Now, you are informed that the future data you are going to get is likely to have the following class distribution: 90% class “positive” and 10% class “negative”. Would you go ahead to use the best model to do prediction for all future data? Provide a reason for your answer. In the case that your answer is no, you shall also provide an alternative solution to do prediction for the future data. The training data is not a representative of the testing data in terms of class distribution; the one has a 1:1 ratio and another has 9:1 ratio. This is why one should not use the model from a data set not representative of the testing data to do prediction for future data. If one does, then it is likely to lead to poor performance i.e. high error rate.
  • 15. The same model applied to 9:1 ratio testing data will perform a lot worse, as 80% of the time the model will predict a class “positive” regardless of the input. The simple model will have accuracy of 90% on the 9:1 ratio testing data. One should be aware of how some models allow one to adjust outputs to match the class distribution of the testing data. Only with this adjustment, than one can use the model trained from different class distribution to be applied. 4. How does one decide whether to collect more data or not in a non-time series data mining task? When there is a small portion of data to mine, one should collect more data. Where the is a lot of data one should not collect more data is there is sufficient data to mine. In both situations time is not a factor, everything is based on the quantity of data required to mine a data mining task.

Editor's Notes

  • #3: The data mining process is a multi-step process that often requires several iterations in order to produce satisfactory results. Data mining has 8 steps, namely defining the problem, collecting data, preparing data, pre-processing, selecting and algorithm and training parameters, training and testing, iterating to produce different models, and evaluating the final model.The first step defines the objective that drives the whole data mining process. The next three steps involve collecting the required data, preparing and preprocessing the data
  翻译: