The document provides an introduction to the Orange data mining and visualization tool. It discusses what data mining is and its major tasks, including classification, clustering, deviation detection, forecasting, and description. It also lists major industries that use data mining, such as retail, finance, education, and healthcare. The document then introduces Orange, describing it as an open-source, component-based, visual programming software that allows data mining through visual programming and Python scripting without requiring any programming. It provides a link to download Orange and walks through loading a heart disease dataset and exploring it using various algorithms like KNN, Naive Bayes, decision trees, random forests, logistic regression, and neural networks. Performance results are compared for different algorithms
This document summarizes a study that used data mining techniques to predict crime using real-world crime datasets from Denver and Los Angeles. The goals were to identify crime hotspots and predict future crime types based on location, time, and other attributes. The models tested included the Apriori algorithm to identify frequent crime patterns, a naïve Bayesian classifier to predict crime type based on location/time features, and a decision tree classifier. Key results identified crime hotspots and showed the Bayesian classifier achieved prediction accuracies of 51-54% while the decision tree was more complex and achieved lower accuracy.
This document discusses using data mining techniques for crime analysis and prediction. It describes collecting unstructured crime data from various sources and storing it in a NoSQL database. Classification algorithms like Naive Bayes are used to classify crime reports. Apriori algorithm identifies patterns in past crimes. A decision tree is used for prediction. Visualization tools like heat maps, graphs and Neo4j are used to display crime patterns, rates and profiles over time and locations. Future work involves using these techniques for criminal profiling.
Orange is an open source Python-based collection of modules that handles machine learning and data mining algorithms. It provides a graphical user interface through widgets to access its data mining and machine learning methods in a visual programming environment. Orange is platform independent, supports various versions of Linux, Windows, and Mac, and provides both visual programming tools and a Python scripting interface for data mining.
This document describes a smart helmet system that uses sensors and wireless communication to improve rider safety. The system detects if the rider has been in an accident using an accelerometer. It also checks if the rider has been drinking alcohol and whether they are wearing a helmet before allowing the bike to start. If an accident occurs, the helmet sends an alert message with the rider's location to registered contacts using GPS and GSM. The system aims to reduce accidents by ensuring safe riding conditions and enabling faster emergency response when needed.
HL7
Health level 7
What is HL7?
What does it stand for
HL7 Mission
HL7 contains message standards
HL7 in HealthcareManagement System
Standards
Limitations of HL7
The document discusses artificial intelligence (AI), defining it as the simulation of human intelligence by machines. It outlines the principles of AI, including reasoning, knowledge, planning, learning and communication. It then discusses applications of AI in various fields such as healthcare, music, telecommunications, gaming, finance, business, and more. The document also covers the advantages and disadvantages of AI, as well as its growth and future prospects. It concludes by discussing how AI may change the world as its development and capabilities continue to increase exponentially.
Abstract: This PDSG workshop introduces basic concepts of multiple linear regression in machine learning. Concepts covered are Feature Elimination and Backward Elimination, with examples in Python.
Level: Fundamental
Requirements: Should have some experience with Python programming.
Random forests are an ensemble learning method that constructs multiple decision trees during training and outputs the class that is the mode of the classes of the individual trees. It improves upon decision trees by reducing variance. The algorithm works by:
1) Randomly sampling cases and variables to grow each tree.
2) Splitting nodes using the gini index or information gain on the randomly selected variables.
3) Growing each tree fully without pruning.
4) Aggregating the predictions of all trees using a majority vote. This reduces variance compared to a single decision tree.
This presentation introduces naive Bayesian classification. It begins with an overview of Bayes' theorem and defines a naive Bayes classifier as one that assumes conditional independence between predictor variables given the class. The document provides examples of text classification using naive Bayes and discusses its advantages of simplicity and accuracy, as well as its limitation of assuming independence. It concludes that naive Bayes is a commonly used and effective classification technique.
This document discusses various classification algorithms including k-nearest neighbors, decision trees, naive Bayes classifier, and logistic regression. It provides examples of how each algorithm works. For k-nearest neighbors, it shows how an unknown data point would be classified based on its nearest neighbors. For decision trees, it illustrates how a tree is built by splitting the data into subsets at each node until pure subsets are reached. It also provides an example decision tree to predict whether Amit will play cricket. For naive Bayes, it gives an example of calculating the probability of cancer given a patient is a smoker.
The document discusses decision trees and random forest algorithms. It begins with an outline and defines the problem as determining target attribute values for new examples given a training data set. It then explains key requirements like discrete classes and sufficient data. The document goes on to describe the principles of decision trees, including entropy and information gain as criteria for splitting nodes. Random forests are introduced as consisting of multiple decision trees to help reduce variance. The summary concludes by noting out-of-bag error rate can estimate classification error as trees are added.
Support Vector Machine ppt presentationAyanaRukasar
Support vector machines (SVM) is a supervised machine learning algorithm used for both classification and regression problems. However, it is primarily used for classification. The goal of SVM is to create the best decision boundary, known as a hyperplane, that separates clusters of data points. It chooses extreme data points as support vectors to define the hyperplane. SVM is effective for problems that are not linearly separable by transforming them into higher dimensional spaces. It works well when there is a clear margin of separation between classes and is effective for high dimensional data. An example use case in Python is presented.
This document describes the 5 steps of principal component analysis (PCA):
1) Subtract the mean from each dimension of the data to center it around zero.
2) Calculate the covariance matrix of the data.
3) Calculate the eigenvalues and eigenvectors of the covariance matrix.
4) Form a feature vector by selecting eigenvectors corresponding to largest eigenvalues. Project the data onto this to reduce dimensions.
5) To reconstruct the data, take the transpose of the feature vector and multiply it with the projected data, then add the mean back.
This document discusses decision tree regression for predicting salary based on position level. It shows how to import data, build a decision tree regression model using scikit-learn in Python and rpart in R, make predictions, and plot the results. It notes that decision trees are discrete models, so the plots need to treat the x-axis as discrete rather than continuous to properly visualize the model's piecewise constant predictions.
1. Machine learning involves developing algorithms that can learn from data and improve their performance over time without being explicitly programmed. 2. Neural networks are a type of machine learning algorithm inspired by the human brain that can perform both supervised and unsupervised learning tasks. 3. Supervised learning involves using labeled training data to infer a function that maps inputs to outputs, while unsupervised learning involves discovering hidden patterns in unlabeled data through techniques like clustering.
Genetic algorithms are inspired by Darwin's theory of natural selection and use techniques like inheritance, mutation, and selection to find optimal solutions. The document discusses genetic algorithms and their application in data mining. It provides examples of how genetic algorithms use selection, crossover, and mutation operators to evolve rules for predicting voter behavior from historical election data. The advantages are that genetic algorithms can solve complex problems where traditional search methods fail, and provide multiple solutions. Limitations include not guaranteeing a global optimum and variable optimization times. Applications include optimization, machine learning, and economic modeling.
The document discusses frequent pattern mining and the Apriori algorithm. It introduces frequent patterns as frequently occurring sets of items in transaction data. The Apriori algorithm is described as a seminal method for mining frequent itemsets via multiple passes over the data, generating candidate itemsets and pruning those that are not frequent. Challenges with Apriori include multiple database scans and large number of candidate sets generated.
K-Nearest neighbor is one of the most commonly used classifier based in lazy learning. It is one of the most commonly used methods in recommendation systems and document similarity measures. It mainly uses Euclidean distance to find the similarity measures between two data points.
The document discusses the random forest algorithm. It introduces random forest as a supervised classification algorithm that builds multiple decision trees and merges them to provide a more accurate and stable prediction. It then provides an example pseudocode that randomly selects features to calculate the best split points to build decision trees, repeating the process to create a forest of trees. The document notes key advantages of random forest are that it avoids overfitting and can be used for both classification and regression tasks.
Decision trees are a type of supervised learning algorithm used for classification and regression. ID3 and C4.5 are algorithms that generate decision trees by choosing the attribute with the highest information gain at each step. Random forest is an ensemble method that creates multiple decision trees and aggregates their results, improving accuracy. It introduces randomness when building trees to decrease variance.
Pattern recognition and Machine Learning.Rohit Kumar
Machine learning involves using examples to generate a program or model that can classify new examples. It is useful for tasks like recognizing patterns, generating patterns, and predicting outcomes. Some common applications of machine learning include optical character recognition, biometrics, medical diagnosis, and information retrieval. The goal of machine learning is to build models that can recognize patterns in data and make predictions.
This document discusses decision trees, a classification technique in data mining. It defines classification as assigning class labels to unlabeled data based on a training set. Decision trees generate a tree structure to classify data, with internal nodes representing attributes, branches representing attribute values, and leaf nodes holding class labels. An algorithm is used to recursively split the data set into purer subsets based on attribute tests until each subset belongs to a single class. The tree can then classify new examples by traversing it from root to leaf.
Decision tree is a type of supervised learning algorithm (having a pre-defined target variable) that is mostly used in classification problems. It is a tree in which each branch node represents a choice between a number of alternatives, and each leaf node represents a decision.
This document discusses machine learning concepts like supervised and unsupervised learning. It explains that supervised learning uses known inputs and outputs to learn rules while unsupervised learning deals with unknown inputs and outputs. Classification and regression are described as types of supervised learning problems. Classification involves categorizing data into classes while regression predicts continuous, real-valued outputs. Examples of classification and regression problems are provided. Classification models like heuristic, separation, regression and probabilistic models are also mentioned. The document encourages learning more about classification algorithms in upcoming videos.
Multiclass classification of imbalanced dataSaurabhWani6
Pydata Talk on Classification of imbalanced data.
It is an overview of concepts for better classification in imbalanced datasets.
Resampling techniques are introduced along with bagging and boosting methods.
Machine learning concepts and techniques are summarized in three paragraphs. Key points include:
Learning allows systems to perform tasks more efficiently over time by modifying representations based on experiences. Major learning paradigms include supervised learning from labeled examples, unsupervised learning like clustering without labels, and reinforcement learning using feedback/rewards.
Decision trees are a common inductive learning approach that extrapolate patterns from training examples to classify new examples. They are built top-down by selecting attributes that best split examples into homogeneous groups. The attribute with highest information gain is selected at each node.
Decision trees may be evaluated on predictive accuracy and pruned to avoid overfitting. Rules can be extracted from trees' paths. Parameters are set using
Random forests are an ensemble learning method that constructs multiple decision trees during training and outputs the class that is the mode of the classes of the individual trees. It improves upon decision trees by reducing variance. The algorithm works by:
1) Randomly sampling cases and variables to grow each tree.
2) Splitting nodes using the gini index or information gain on the randomly selected variables.
3) Growing each tree fully without pruning.
4) Aggregating the predictions of all trees using a majority vote. This reduces variance compared to a single decision tree.
This presentation introduces naive Bayesian classification. It begins with an overview of Bayes' theorem and defines a naive Bayes classifier as one that assumes conditional independence between predictor variables given the class. The document provides examples of text classification using naive Bayes and discusses its advantages of simplicity and accuracy, as well as its limitation of assuming independence. It concludes that naive Bayes is a commonly used and effective classification technique.
This document discusses various classification algorithms including k-nearest neighbors, decision trees, naive Bayes classifier, and logistic regression. It provides examples of how each algorithm works. For k-nearest neighbors, it shows how an unknown data point would be classified based on its nearest neighbors. For decision trees, it illustrates how a tree is built by splitting the data into subsets at each node until pure subsets are reached. It also provides an example decision tree to predict whether Amit will play cricket. For naive Bayes, it gives an example of calculating the probability of cancer given a patient is a smoker.
The document discusses decision trees and random forest algorithms. It begins with an outline and defines the problem as determining target attribute values for new examples given a training data set. It then explains key requirements like discrete classes and sufficient data. The document goes on to describe the principles of decision trees, including entropy and information gain as criteria for splitting nodes. Random forests are introduced as consisting of multiple decision trees to help reduce variance. The summary concludes by noting out-of-bag error rate can estimate classification error as trees are added.
Support Vector Machine ppt presentationAyanaRukasar
Support vector machines (SVM) is a supervised machine learning algorithm used for both classification and regression problems. However, it is primarily used for classification. The goal of SVM is to create the best decision boundary, known as a hyperplane, that separates clusters of data points. It chooses extreme data points as support vectors to define the hyperplane. SVM is effective for problems that are not linearly separable by transforming them into higher dimensional spaces. It works well when there is a clear margin of separation between classes and is effective for high dimensional data. An example use case in Python is presented.
This document describes the 5 steps of principal component analysis (PCA):
1) Subtract the mean from each dimension of the data to center it around zero.
2) Calculate the covariance matrix of the data.
3) Calculate the eigenvalues and eigenvectors of the covariance matrix.
4) Form a feature vector by selecting eigenvectors corresponding to largest eigenvalues. Project the data onto this to reduce dimensions.
5) To reconstruct the data, take the transpose of the feature vector and multiply it with the projected data, then add the mean back.
This document discusses decision tree regression for predicting salary based on position level. It shows how to import data, build a decision tree regression model using scikit-learn in Python and rpart in R, make predictions, and plot the results. It notes that decision trees are discrete models, so the plots need to treat the x-axis as discrete rather than continuous to properly visualize the model's piecewise constant predictions.
1. Machine learning involves developing algorithms that can learn from data and improve their performance over time without being explicitly programmed. 2. Neural networks are a type of machine learning algorithm inspired by the human brain that can perform both supervised and unsupervised learning tasks. 3. Supervised learning involves using labeled training data to infer a function that maps inputs to outputs, while unsupervised learning involves discovering hidden patterns in unlabeled data through techniques like clustering.
Genetic algorithms are inspired by Darwin's theory of natural selection and use techniques like inheritance, mutation, and selection to find optimal solutions. The document discusses genetic algorithms and their application in data mining. It provides examples of how genetic algorithms use selection, crossover, and mutation operators to evolve rules for predicting voter behavior from historical election data. The advantages are that genetic algorithms can solve complex problems where traditional search methods fail, and provide multiple solutions. Limitations include not guaranteeing a global optimum and variable optimization times. Applications include optimization, machine learning, and economic modeling.
The document discusses frequent pattern mining and the Apriori algorithm. It introduces frequent patterns as frequently occurring sets of items in transaction data. The Apriori algorithm is described as a seminal method for mining frequent itemsets via multiple passes over the data, generating candidate itemsets and pruning those that are not frequent. Challenges with Apriori include multiple database scans and large number of candidate sets generated.
K-Nearest neighbor is one of the most commonly used classifier based in lazy learning. It is one of the most commonly used methods in recommendation systems and document similarity measures. It mainly uses Euclidean distance to find the similarity measures between two data points.
The document discusses the random forest algorithm. It introduces random forest as a supervised classification algorithm that builds multiple decision trees and merges them to provide a more accurate and stable prediction. It then provides an example pseudocode that randomly selects features to calculate the best split points to build decision trees, repeating the process to create a forest of trees. The document notes key advantages of random forest are that it avoids overfitting and can be used for both classification and regression tasks.
Decision trees are a type of supervised learning algorithm used for classification and regression. ID3 and C4.5 are algorithms that generate decision trees by choosing the attribute with the highest information gain at each step. Random forest is an ensemble method that creates multiple decision trees and aggregates their results, improving accuracy. It introduces randomness when building trees to decrease variance.
Pattern recognition and Machine Learning.Rohit Kumar
Machine learning involves using examples to generate a program or model that can classify new examples. It is useful for tasks like recognizing patterns, generating patterns, and predicting outcomes. Some common applications of machine learning include optical character recognition, biometrics, medical diagnosis, and information retrieval. The goal of machine learning is to build models that can recognize patterns in data and make predictions.
This document discusses decision trees, a classification technique in data mining. It defines classification as assigning class labels to unlabeled data based on a training set. Decision trees generate a tree structure to classify data, with internal nodes representing attributes, branches representing attribute values, and leaf nodes holding class labels. An algorithm is used to recursively split the data set into purer subsets based on attribute tests until each subset belongs to a single class. The tree can then classify new examples by traversing it from root to leaf.
Decision tree is a type of supervised learning algorithm (having a pre-defined target variable) that is mostly used in classification problems. It is a tree in which each branch node represents a choice between a number of alternatives, and each leaf node represents a decision.
This document discusses machine learning concepts like supervised and unsupervised learning. It explains that supervised learning uses known inputs and outputs to learn rules while unsupervised learning deals with unknown inputs and outputs. Classification and regression are described as types of supervised learning problems. Classification involves categorizing data into classes while regression predicts continuous, real-valued outputs. Examples of classification and regression problems are provided. Classification models like heuristic, separation, regression and probabilistic models are also mentioned. The document encourages learning more about classification algorithms in upcoming videos.
Multiclass classification of imbalanced dataSaurabhWani6
Pydata Talk on Classification of imbalanced data.
It is an overview of concepts for better classification in imbalanced datasets.
Resampling techniques are introduced along with bagging and boosting methods.
Multiclass classification of imbalanced dataSaurabhWani6
Similar to Decision Tree, Naive Bayes, Association Rule Mining, Support Vector Machine, K nearest Neighbour, K means Clustering, Random Forest By akanksha Bali (20)
Machine learning concepts and techniques are summarized in three paragraphs. Key points include:
Learning allows systems to perform tasks more efficiently over time by modifying representations based on experiences. Major learning paradigms include supervised learning from labeled examples, unsupervised learning like clustering without labels, and reinforcement learning using feedback/rewards.
Decision trees are a common inductive learning approach that extrapolate patterns from training examples to classify new examples. They are built top-down by selecting attributes that best split examples into homogeneous groups. The attribute with highest information gain is selected at each node.
Decision trees may be evaluated on predictive accuracy and pruned to avoid overfitting. Rules can be extracted from trees' paths. Parameters are set using
The document discusses classification, which is a type of supervised learning where models are used to predict categorical class labels. It covers classification processes including model construction using a training set and model usage to classify future objects. Specific classification algorithms covered include decision trees, naive Bayes, neural networks, and support vector machines. Evaluation metrics for classification methods such as accuracy, speed, and interpretability are also discussed.
This document discusses frequent pattern mining and association rule learning. It begins by defining frequent patterns as patterns that occur frequently in a dataset. Apriori and FP-Growth are introduced as two popular algorithms for mining frequent itemsets and generating association rules. The document then provides more details on the concepts and implementation of these two algorithms. It explains how Apriori uses a generate-and-test approach with candidate generation while FP-Growth adopts a pattern growth method to avoid candidate generation. Examples are also given to illustrate how each algorithm works step-by-step.
Module-2_Notes-with-Example for data sciencepujashri1975
The document discusses several key concepts in probability and statistics:
- Conditional probability is the probability of one event occurring given that another event has already occurred.
- The binomial distribution models the probability of success in a fixed number of binary experiments. It applies when there are a fixed number of trials, two possible outcomes, and the same probability of success on each trial.
- The normal distribution is a continuous probability distribution that is symmetric and bell-shaped. It is characterized by its mean and standard deviation. Many real-world variables approximate a normal distribution.
- Other concepts discussed include range, interquartile range, variance, and standard deviation. The interquartile range describes the spread of a dataset's middle 50%
The document discusses concepts related to supervised machine learning and decision tree algorithms. It defines key terms like supervised vs unsupervised learning, concept learning, inductive bias, and information gain. It also describes the basic process for learning decision trees, including selecting the best attribute at each node using information gain to create a small tree that correctly classifies examples, and evaluating performance on test data. Extensions like handling real-valued, missing and noisy data, generating rules from trees, and pruning trees to avoid overfitting are also covered.
The document discusses various classification techniques in machine learning. It begins by distinguishing between supervised and unsupervised learning, with classification being a form of supervised learning where training data is labeled. It then describes classification as a two-step process of model construction using a training set, and model usage to classify new data. Several specific classification algorithms are covered, including decision trees, naive Bayes classification, and rule-based classification. The document also discusses evaluating classifier accuracy using methods like holdout validation, cross-validation, and bootstrap.
- Machine learning is a method of data analysis that automates analytical model building to understand and analyze patterns in data to make decisions without explicit programming. Common applications include virtual assistants, traffic predictions, fraud detection, and recommendations.
- There are two main types of machine learning - supervised learning, where the training data is labeled and the algorithm learns from examples to predict labels for new data, and unsupervised learning, where the training data is unlabeled and the algorithm looks for hidden patterns in the data.
- Bayesian decision theory provides a statistical framework for classification problems based on quantifying costs and probabilities to determine optimal predictions. It uses Bayes' rule to calculate the posterior probability of a class given predictor values.
An Experimental Study of Diabetes Disease Prediction System Using Classificat...IOSRjournaljce
Data mining means to the process of collecting, searching through, and analyzing a large amount of data in a database. Classification in one of the well-known data mining techniques for analyzing the performance of Naive Bayes, Random Forest, and Naïve Bayes tree (NB-Tree) classifier during the classification to improve precision, recall, f-measure, and accuracy. These three algorithms, of Naive Bayes, Random Forest, and NB-Tree are useful and efficient, has been tested in the medical dataset for diabetes disease and solving classification problem in data mining. In this paper, we compare the three different algorithms, and results indicate the Naive Bayes algorithms are able to achieve high accuracy rate along with minimum error rate when compared to other algorithms.
Machine learning basics using trees algorithm (Random forest, Gradient Boosting)Parth Khare
This document provides an overview of machine learning classification and decision trees. It discusses key concepts like supervised vs. unsupervised learning, and how decision trees work by recursively partitioning data into nodes. Random forest and gradient boosted trees are introduced as ensemble methods that combine multiple decision trees. Random forest grows trees independently in parallel while gradient boosted trees grow sequentially by minimizing error from previous trees. While both benefit from ensembling, gradient boosted trees are more prone to overfitting and random forests are better at generalizing to new data.
Descriptive and diagnostic analytics describe and analyze organizational data through measures like mean, median, standard deviation and charts/tables to identify patterns and insights. Predictive analytics use statistical techniques like regression and clustering to predict future outcomes and behaviors. Prescriptive analytics provide recommendations for decisions and actions based on optimization models. The document then focuses on classification techniques in predictive analytics, covering methods like decision trees, rules, Naive Bayes and nearest neighbor algorithms.
This document discusses classification techniques for supervised learning. It begins with an overview of classification and describes the two-step classification process of model construction using a training set and then applying the model to classify new data. It then covers specific classification algorithms like decision tree induction, Bayesian classification methods, and rule-based classification. It also discusses evaluating and selecting models as well as techniques for improving accuracy, such as ensemble methods.
Classifiers are algorithms that map input data to categories in order to build models for predicting unknown data. There are several types of classifiers that can be used including logistic regression, decision trees, random forests, support vector machines, Naive Bayes, and neural networks. Each uses different techniques such as splitting data, averaging predictions, or maximizing margins to classify data. The best classifier depends on the problem and achieving high accuracy, sensitivity, and specificity.
The document discusses the Naive Bayes classification model. It begins by explaining that a Naive Bayes classifier is a simple probabilistic classifier based on Bayes' theorem that makes strong independence assumptions. It assumes the presence or absence of a feature is unrelated to any other feature. The document then provides mathematical formulas to describe the Naive Bayes probabilistic model and explains how to apply it to classify data. An example is shown predicting whether someone will buy a computer based on attributes like age, income, student status, and credit rating. The document concludes by discussing some common applications of Naive Bayes classification like text classification, spam filtering, and recommender systems.
The document discusses Naive Bayesian classification and its use in machine learning. It begins by defining Bayesian classification and how it uses probability to represent uncertainty in learning relationships from data. It then describes the key assumptions of naive Bayesian classifiers, including class conditional independence. The document provides an example of how a naive Bayesian classifier would work on a weather dataset to predict if someone will play or not play based on weather attributes. It concludes by discussing some advantages and disadvantages of the naive Bayesian approach.
Feedback by akanksha bali, Feedback of FDP, Shortterm course, WorkshopAkanksha Bali
The document provides feedback from a participant on a 5-day online data science training organized by CTAE, Rajasthan from June 22-26, 2020.
The participant expresses gratitude to the organizers for the opportunity to gain valuable data science knowledge and experience through interactive sessions. All sessions were informative and experts delivered content clearly and were able to clarify doubts. Practical hands-on sessions effectively demonstrated concepts. The training improved the participant's understanding of key data science topics like regression, k-means clustering, and handling big data in MATLAB. They also learned about newer areas like CNNs, IoT with data science, and techniques for reducing stress. The participant thanks the organizers for conducting the free training during the lockdown
This document provides feedback from a research scholar about an online blockchain workshop organized by UIET, Kurukshetra University from June 15-19, 2020. The feedback praised the workshop for being knowledgeable and interesting, with clear sessions that helped gain a deep understanding of blockchain technologies. Specific sessions on blockchain fundamentals, bitcoin, private blockchains, and blockchain with IoT were highlighted as being highly informative and taught in an enjoyable, understandable manner. The resource persons were thanked for their engaging and dedicated sessions. In conclusion, the feedback expressed gratitude to the organizing team for conducting the free, knowledgeable workshop and for patiently addressing queries.
The document provides an overview of regression analysis techniques including linear regression and logistic regression. It defines regression as a statistical technique to model relationships between variables, with the goal of prediction or forecasting. Linear regression finds the best fitting straight line to model relationships between a continuous dependent variable and one or more independent variables. Logistic regression is used for classification problems where the dependent variable is categorical. The document explains the key differences between linear and logistic regression techniques.
Regression (Linear Regression and Logistic Regression) by Akanksha BaliAkanksha Bali
Regression analysis is a statistical technique used to examine relationships between variables. Linear regression finds the best fitting straight line through data points to model the relationship between a continuous dependent variable (Y) and one or more independent variables (X). Logistic regression produces results in a binary format to predict outcomes of categorical dependent variables. It transforms the linear equation using logarithms to restrict predicted Y values between 0 and 1.
Machine learning basics by akanksha baliAkanksha Bali
This document provides an introduction to machine learning, including definitions of machine learning, why it is needed, and the main types of machine learning algorithms. It describes supervised learning, unsupervised learning, semi-supervised learning, and reinforcement learning. For each type, it provides examples and brief explanations. It also discusses applications of machine learning and the differences between machine learning and deep learning.
The term Machine Learning was coined by Arthur Samuel in 1959, an american pioneer in the field of computer gaming and artificial intelligence and stated that “ it gives computers the ability to learn without being explicitly programmed” And in 1997, Tom Mitchell gave a “ well-Posed” mathematical and relational definition that “ A Computer Program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E”.
Machine learning is needed for tasks that are too complex for humans to code directly. So instead, we provide a large amount of data to a machine learning algorithm and let the algorithm work it out by exploring that data and searching for a model that will achieve what the programmers have set it out to achieve.
Construction Materials (Paints) in Civil EngineeringLavish Kashyap
This file will provide you information about various types of Paints in Civil Engineering field under Construction Materials.
It will be very useful for all Civil Engineering students who wants to search about various Construction Materials used in Civil Engineering field.
Paint is a vital construction material used for protecting surfaces and enhancing the aesthetic appeal of buildings and structures. It consists of several components, including pigments (for color), binders (to hold the pigment together), solvents or thinners (to adjust viscosity), and additives (to improve properties like durability and drying time).
Paint is one of the material used in Civil Engineering field. It is especially used in final stages of construction project.
Paint plays a dual role in construction: it protects building materials and contributes to the overall appearance and ambiance of a space.
Welcome to the May 2025 edition of WIPAC Monthly celebrating the 14th anniversary of the WIPAC Group and WIPAC monthly.
In this edition along with the usual news from around the industry we have three great articles for your contemplation
Firstly from Michael Dooley we have a feature article about ammonia ion selective electrodes and their online applications
Secondly we have an article from myself which highlights the increasing amount of wastewater monitoring and asks "what is the overall" strategy or are we installing monitoring for the sake of monitoring
Lastly we have an article on data as a service for resilient utility operations and how it can be used effectively.
OPTIMIZING DATA INTEROPERABILITY IN AGILE ORGANIZATIONS: INTEGRATING NONAKA’S...ijdmsjournal
Agile methodologies have transformed organizational management by prioritizing team autonomy and
iterative learning cycles. However, these approaches often lack structured mechanisms for knowledge
retention and interoperability, leading to fragmented decision-making, information silos, and strategic
misalignment. This study proposes an alternative approach to knowledge management in Agile
environments by integrating Ikujiro Nonaka and Hirotaka Takeuchi’s theory of knowledge creation—
specifically the concept of Ba, a shared space where knowledge is created and validated—with Jürgen
Habermas’s Theory of Communicative Action, which emphasizes deliberation as the foundation for trust
and legitimacy in organizational decision-making. To operationalize this integration, we propose the
Deliberative Permeability Metric (DPM), a diagnostic tool that evaluates knowledge flow and the
deliberative foundation of organizational decisions, and the Communicative Rationality Cycle (CRC), a
structured feedback model that extends the DPM, ensuring long-term adaptability and data governance.
This model was applied at Livelo, a Brazilian loyalty program company, demonstrating that structured
deliberation improves operational efficiency and reduces knowledge fragmentation. The findings indicate
that institutionalizing deliberative processes strengthens knowledge interoperability, fostering a more
resilient and adaptive approach to data governance in complex organizations.
This research presents the optimization techniques for reinforced concrete waffle slab design because the EC2 code cannot provide an efficient and optimum design. Waffle slab is mostly used where there is necessity to avoid column interfering the spaces or for a slab with large span or as an aesthetic purpose. Design optimization has been carried out here with MATLAB, using genetic algorithm. The objective function include the overall cost of reinforcement, concrete and formwork while the variables comprise of the depth of the rib including the topping thickness, rib width, and ribs spacing. The optimization constraints are the minimum and maximum areas of steel, flexural moment capacity, shear capacity and the geometry. The optimized cost and slab dimensions are obtained through genetic algorithm in MATLAB. The optimum steel ratio is 2.2% with minimum slab dimensions. The outcomes indicate that the design of reinforced concrete waffle slabs can be effectively carried out using the optimization process of genetic algorithm.
This research is oriented towards exploring mode-wise corridor level travel-time estimation using Machine learning techniques such as Artificial Neural Network (ANN) and Support Vector Machine (SVM). Authors have considered buses (equipped with in-vehicle GPS) as the probe vehicles and attempted to calculate the travel-time of other modes such as cars along a stretch of arterial roads. The proposed study considers various influential factors that affect travel time such as road geometry, traffic parameters, location information from the GPS receiver and other spatiotemporal parameters that affect the travel-time. The study used a segment modeling method for segregating the data based on identified bus stop locations. A k-fold cross-validation technique was used for determining the optimum model parameters to be used in the ANN and SVM models. The developed models were tested on a study corridor of 59.48 km stretch in Mumbai, India. The data for this study were collected for a period of five days (Monday-Friday) during the morning peak period (from 8.00 am to 11.00 am). Evaluation scores such as MAPE (mean absolute percentage error), MAD (mean absolute deviation) and RMSE (root mean square error) were used for testing the performance of the models. The MAPE values for ANN and SVM models are 11.65 and 10.78 respectively. The developed model is further statistically validated using the Kolmogorov-Smirnov test. The results obtained from these tests proved that the proposed model is statistically valid.
Deepfake Phishing: A New Frontier in Cyber ThreatsRaviKumar256934
n today’s hyper-connected digital world, cybercriminals continue to develop increasingly sophisticated methods of deception. Among these, deepfake phishing represents a chilling evolution—a combination of artificial intelligence and social engineering used to exploit trust and compromise security.
Deepfake technology, once a novelty used in entertainment, has quickly found its way into the toolkit of cybercriminals. It allows for the creation of hyper-realistic synthetic media, including images, audio, and videos. When paired with phishing strategies, deepfakes can become powerful weapons of fraud, impersonation, and manipulation.
This document explores the phenomenon of deepfake phishing, detailing how it works, why it’s dangerous, and how individuals and organizations can defend themselves against this emerging threat.
David Boutry - Specializes In AWS, Microservices And PythonDavid Boutry
With over eight years of experience, David Boutry specializes in AWS, microservices, and Python. As a Senior Software Engineer in New York, he spearheaded initiatives that reduced data processing times by 40%. His prior work in Seattle focused on optimizing e-commerce platforms, leading to a 25% sales increase. David is committed to mentoring junior developers and supporting nonprofit organizations through coding workshops and software development.
David Boutry - Specializes In AWS, Microservices And PythonDavid Boutry
Decision Tree, Naive Bayes, Association Rule Mining, Support Vector Machine, K nearest Neighbour, K means Clustering, Random Forest By akanksha Bali
1. Decision Tree, Naive Bayes,
Association rule Mining, Support
Vector Machine, KNN, Kmeans
Clustering, Random Forest
Presented to
Prof. Vibhakar Mansotra
Dean of Mathematical Science, University of Jammu
Presented by
Akanksha Bali
Research Scholar,Batch-2019, University of Jammu
2. Contents
Decision Tree
Naive Bayes Classifier
Support Vector Machine
Association Rule Mining
Apriori Algorithm
K Nearest Neighbour
K means Clustering
Random forest
2
3. Decision Trees
A decision tree is a flowchart-like tree structure where the data is continuously
split according to a certain parameter
Each internal node(decision node) denotes a test on an
attribute.
Each branch represents an outcome of the test.
Here are two main types of decision trees:
Classification trees (yes/no types)
What we’ve seen above is an example of classification tree, where the
outcome was a variable like ‘fit’ or ‘unfit’. Here the decision variable
is categorical.
Regression trees (continuous data types)
Here the decision or the outcome variable is continuous, e.g. a number
like 12
3
4. 4
Entropy
Entropy
Entropy, also called as shannon entropy is denoted by H(S) for a finite set S,
is the measure of the amount of uncertainty or randomness in data.
H(S) = ∑ p(x)log2p(x)
Information gain
Information gain is also called as kullback-leibler divergence denoted by
IG(S,A) for a set S is the effective change in entropy after deciding on a
particular attribute A. It measures the relative change in entropy with respect
to the independent variables.
IG(S,A) = H(S)-H(S,A)
IG(S,A) = H(S) - ∑P(x)*H(x)
Where IG(S, A) is the information gain by applying feature A. H(S) is the
entropy of the entire set, while the second term calculates the entropy after
applying the feature A, where p(x) is the probability of event x.
5. 5
Top-Down Induction of Decision
Trees ID3
D3 Algorithm will perform following tasks recursively
1.Create root node for the tree
2.If all examples are positive, return leaf node ‘positive’
3.Else if all examples are negative, return leaf node ‘negative’
4.Calculate the entropy of current state H(S)
5.For each attribute, calculate the entropy with respect to the attribute
‘x’ denoted by H(S, x)
6. Calculate
7. Select the attribute which has maximum value of IG(S, x)
8. Remove the attribute that offers highest IG from the set of attributes
9. Repeat until we run out of all attributes, or the decision tree has all
leaf nodes.
7. 7
Selecting the Next Attribute
Humidity
High Normal
[3+, 4-] [6+, 1-]
S=[9+,5-]
E=0.940
Gain(S,Humidity)
=0.940-(7/14)*0.985
– (7/14)*0.592
=0.151
E=0.985 E=0.592
Wind
Weak Strong
[6+, 2-] [3+, 3-]
S=[9+,5-]
E=0.940
E=0.811 E=1.0
Gain(S,Wind)
=0.940-(8/14)*0.811
– (6/14)*1.0
=0.048
Humidity provides greater info. gain than Wind, w.r.t target classification.
8. 8
Selecting the Next Attribute
Outlook
Sunny Rain
[2+, 3-] [3+, 2-]
S=[9+,5-]
E=0.940
Gain(S,Outlook)
=0.940-(5/14)*0.971
-(4/14)*0.0 – (5/14)*0.0971
=0.247
E=0.971 E=0.971
Over
cast
[4+, 0]
E=0.0
9. 9
Selecting the Next Attribute
The information gain values for the 4 attributes
are:
• Gain(S,Outlook) =0.247
• Gain(S,Humidity) =0.151
• Gain(S,Wind) =0.048
• Gain(S,Temperature) =0.029
where S denotes the collection of training
examples
12. 12
Converting a Tree to Rules
Outlook
Sunny Overcast Rain
Humidity
High Normal
Wind
Strong Weak
No Yes
Yes
YesNo
R1: If (Outlook=Sunny) ∧ (Humidity=High) Then PlayTennis=No
R2: If (Outlook=Sunny) ∧ (Humidity=Normal) Then PlayTennis=Yes
R3: If (Outlook=Overcast) Then PlayTennis=Yes
R4: If (Outlook=Rain) ∧ (Wind=Strong) Then PlayTennis=No
R5: If (Outlook=Rain) ∧ (Wind=Weak) Then PlayTennis=Yes
14. 14
Avoid Overfitting
stop growing when split not statistically
significant
grow full tree, then post-prune
15. NAÏVE BAYES ALGORITHM
The Bayesian Classification represents a supervised
learning method as well as a statistical method for
classification.
It can solve diagnostic and predictive problems.
It is based on the name of Thomas Bayes(1700-61).
It works on the principle of comditional probability as
given by the bayes theorem.
15
16. Derivation
Derivation D : Set of tuples
Each Tuple is an ‘n’ dimensional attribute vector X :
(x1,x2,x3,…. xn)
Let there be ‘m’ Classes : C1,C2,C3…Cm
Maximum Posteriori Hypothesis
P(Ci/X) = P(X/Ci) P(Ci) / P(X) (bayes theorem)
16
17. Problem Statement
Consider the given data set, apply naive bayes
algorithm and predict that if the fruit has the
folowing properties then which type of fruit it is
Fruit = { yellow, sweet, long}
Fruit Yellow Sweet Long Total
Orange 350 450 0 650
Banana 400 300 350 400
others 50 100 50 150
Total 800 850 400 1200
18. Problem
Step 1: Compute the prior probabilities for each of the class of fruits:
P(C=orange) = 650/1200 = 0.54
P(C=banana) = 400/1200 = 0.33
P(C=others) = 150/1200 = 0.125
Step 2: Compute the probability of evidence
P(X1=long) = 400/1200=0.33
P(X2=sweet) = 850/1200 = 0.708
P(X3=yellow) = 800/1200 = 0.66
Step 3: Compute the probability of likelihood of evidences
P(C=orange|X1=long) = 0/400 = 0
P(C=orange|X2=sweet) = 450/850 = 0.52
P(C=orange|X3=yellow) = 350/800 = 0.43
P(C=Banana|X1=long) = 350/400 = 0.875
P(C=Banana|X2=sweet) = 300/850 = 0.35
P(C=Banana|X3=yellow) = 400/800 = 0.5
P(C=others|X1=long) = 50/400 = 0.125
P(C=others|X2=sweet) = 100/850 = 0.117
P(C=others|X3=yellow) = 50/800 = 0.0625
18
19. Problem
Step 5: Calculate posterior probability
P(Yellow|Orange)=P(orange|Yellow)*P(yellow)
= (0.43*0.66)/0.5 = 0.5676
P(Sweet|Orange) = 0.69
P(Long|Orange) = 0
Step 6: P(fruit| Orange) = 0.56*0.69*0 = 0
In the Similar way P(fruit|banana)= 1*0.75 * 0.87 = 0.65
P(fruit|others) = 0.33*0.66*0.33 = 0.072
Step 7: Prediction :- type of fruit is Banana
19
P(Orange)
20. Association rule mining
Association rule learning is a
rule-based machine learning method for discovering
interesting relations between variable
Using association rule learning, the supermarket can
determine which products are frequently bought
together and use this information for marketing
purposes. This is sometimes referred to as market
basket analysis.
20
22. Association rule mining
Important concepts of Association Rule Mining:
The support supp(X) of an itemset is defined as the proportion of transactions in
the data set which contain the itemset. In the example database, the itemset
{milk,bread,butter}has a support of 1/5=0.2 since it occurs in 20% of all
transactions (1 out of 5 transactions).
The confidence of a rule is defined conf(X=>Y)= supp(XUY)/supp(X)
For example, the rule {butter, bread}=>{milk}
has a confidence of supp(butter,bread,milk}/support(butter,bread} = 0.2/0.2=1
in the database, which means that for 100% of the transactions containing butter
and bread the rule is correct (100% of the times a customer buys butter and bread,
milk is bought as well).
22
23. APIORI ALGORITHM
The name of the algorithm is based on the fact that the
algorithm uses prior knowledge of frequent itemset properties.
Apriori employs an iterative approach known as a level-wise
search, where k-itemsets are used to explore (k+1)-itemsets.
First, the set of frequent 1-itemsets is found by scanning the
database to accumulate the count for each item, and collecting
those items that satisfy minimum support.
The resulting set is denoted L1.
Next, L1 is used to find L2, the set of frequent 2-itemsets, which
is used to find L3, and so on, until no more frequent k-itemsets
can be found.
The finding of each Lk requires one full scan of the database.
24. Problem Statement
For the following given transaction dataset. Generate rules
using apriori algorithm. Consider the values as support =
50% and confidence = 50%
24
Transaction ID Items
Purchased
I1 A,B,C
I2 A,C
I3 A,D
I4 B,E,F
25. Problem Statement
Step 1: Create table of Frequent itemset and calculate
support
25
items Frequency Support count
{A} 3 ¾=75%
{B} 2 2/4=50%
{C} 2 2/4=50%
{D} 1 ¼=25%
{E} 1 ¼=25%
{F} 1 ¼=25%
26. Problem Statement
Step 2: Choose rows with support value is equal or
greater than 50%
26
items Frequency Support count
{A} 3 ¾=75%
{B} 2 2/4=50%
{C} 2 2/4=50%
27. Problem Statement
Step 3: Create table of 2 item Frequent set and calculate
their frequency and support
27
items Frequency Support count
{A,B} 1 ¼ =25%
{A,C} 2 2/4 =50%
{B,C} 1 ¼ =25%
28. Problem Statement
Step 4: Choose rows with support value is equal or
greater than 50%
Formulate Final rules and calculate confidence
28
items Frequency Support count
{A,C} 2 2/4 =50%
Association
rules
supp confiden
ce
Conf%
A->C 2 2/3=.66 66%
C->A 2 2/2=1 100%
29. SUPPORT VECTOR
MACHINE
Support Vector Machine” (SVM) is a supervised machine
learning algorithm which can be used for both classification or
regression challenges.
Mostly used in classification problems.
we perform classification by finding the hyper-plane
that differentiate the two classes very well
29
32. Support vector machine
Pros:
It works really well with clear margin of separation
It is effective in high dimensional spaces.
It is effective in cases where number of dimensions is greater
than the number of samples.
It uses a subset of training points in the decision function
(called support vectors), so it is also memory efficient.
Cons:
It doesn’t perform well, when we have large data set because
the required training time is higher
It also doesn’t perform very well, when the data set has more
noise i.e. target classes are overlapping
32
34. K Nearest Neighbour
• K-Nearest Neighbors is one of the most basic yet
essential classification algorithms in Machine Learning.
It belongs to the supervised learning domain and finds
intense application in pattern recognition, data mining
and intrusion detection.
• It was first described in the early 1950s.
• Gained popularity, when increased computing power
became available.
• Used widely in area of pattern recognition and
statistical estimation.
34
35. Closeness
The Euclidean distance between two points
or tuples, say,
X1 = (x11,x12,...,x1n) and X2 =(x21,x22,...,x2n),is
35
37. Example
• We have data from the questionnaires survey and objective
testing with two attributes (acid durability and strength) to classify
whether a special paper tissue is good or not. Here are four training
samples :
X1 = Acid Durability
(seconds)
X2 = Strength
(kg/square meter)
Y = Classification
7 7 Bad
7 4 Bad
3 4 Good
1 4 Good
Now the factory produces a new paper tissue that passes the
laboratory test with X1 = 3 and X2 = 7. Guess the classification
of this new tissue.
38. Step 1 : Initialize and Define k.
Lets say, k = 3
(Always choose k as an odd number if the number of attributes is even to avoid
a tie in the class prediction)
Step 2 : Compute the distance between input sample and
trainingsample
- Co-ordinate of the input sample is (3,7).
- Instead of calculating the Euclidean distance, we
calculate the Squared Euclidean distance.
X1 = Acid Durability
(seconds)
X2 = Strength
(kg/square meter)
Squared Euclidean distance
7 7 (7-3)2 + (7-7)2 = 16
7 4 (7-3)2 + (4-7)2 = 25
3 4 (3-3)2 + (4-7)2 = 09
1 4 (1-3)2 + (4-7)2 = 13
39. Step 3 : Sort the distance and determine the nearest
neighbours based of the Kth minimum distance :
X1 = Acid
Durability
(seconds)
X2 = Strength
(kg/square
meter)
Squared
Euclidean
distance
Rank
minimum
distance
Is it included
in 3-Nearest
Neighbour?
7 7 16 3 Yes
7 4 25 4 No
3 4 09 1 Yes
1 4 13 2 Yes
Example
40. Step 4 : Take 3-Nearest Neighbours:
Gather the category Y of the nearest neighbours.
X1 = Acid
Durability
(seconds)
X2 =
Strength
(kg/square
meter)
Squared
Euclidean
distance
Rank
minimum
distance
Is it
included in
3-Nearest
Neighbour?
Y =
Categor
y of the
nearest
neighbo
ur
7 7 16 3 Yes Bad
7 4 25 4 No -
3 4 09 1 Yes Good
1 4 13 2 Yes Good
Example
41. Step 5 : Apply simple majority
Use simple majority of the category of the nearest
neighbours as the prediction value of the query
instance.
We have 2 “good” and 1 “bad”. Thus we conclude
that the new paper tissue that passes the laboratory
test with X1 = 3 and X2 = 7 is included in the
“good” category.
Example
42. K – Means Clustering
Contents
Introduction
Algorithm
Example
Application
42
43. KNN Clustering Algorithm
Clustering: the process of grouping a set of objects into
classes of similar objects
Documents within a cluster should be similar.
Documents from different clusters should be dissimilar.
The commonest form of unsupervised learning
Unsupervised learning = learning from raw data, as
opposed to supervised data where a classification of
examples is given.
in principle, optimal partition achieved via minimising the sum
of squared distance to its “representative object” in each cluster
43
2
1
2
)(),( knn
N
n
k mxd −= ∑=
mxe.g., Euclidean distance =
45. A Simple example showing the implementation
of k-means algorithm
(using K=2)
.
46. Step 1:
Initialization: Randomly we choose following two centroids
(k=2) for two clusters.
In this case the 2 centroid are: m1=(1.0,1.0) and m2=(5.0,7.0).
47. Step 2:
Thus, we obtain two clusters
containing:
{1,2,3} and {4,5,6,7}.
Their new centroids are:
48. Step 3:
Now using these centroids
we compute the Euclidean
distance of each object, as
shown in table.
Therefore, the new
clusters are:
{1,2} and {3,4,5,6,7}
Next centroids are:
m1=(1.25,1.5) and m2 =
(3.9,5.1)
49. λ Step 4 :
The clusters obtained are:
{1,2} and {3,4,5,6,7}
λ Therefore, there is no
change in the cluster.
λ Thus, the algorithm comes
to a halt here and final
result consist of 2 clusters
{1,2} and {3,4,5,6,7}.
50. Example
Subject A B
1 1.0 1.0
2 1.5 2.0
3 3.0 4.0
4 5.0 7.0
5 3.5 5.0
6 4.5 5.0
7 3.5 4.5
50
consider the following data set consisting of the scores
of two variables on each of seven individuals:
51. Example
.
51
This data set is to be grouped into two clusters. As a first step in
finding a sensible initial partition, let the A & B values of the two
individuals furthest apart (using the Euclidean distance measure),
define the initial cluster means, giving:
Individual
Mean Vector
(centroid)
Group 1 1 (1.0, 1.0)
Group 2 4 (5.0, 7.0)
52. Example
The remaining individuals are now examined in sequence and
allocated to the cluster to which they are closest, in terms of
Euclidean distance to the cluster mean. The mean vector is
recalculated each time a new member is added.
52
Cluster 1 Cluster 2
Step Individual
Mean
Vector
(centroid)
Individual
Mean
Vector
(centroid)
1 1 (1.0, 1.0) 4 (5.0, 7.0)
2 1, 2 (1.2, 1.5) 4 (5.0, 7.0)
3 1, 2, 3 (1.8, 2.3) 4 (5.0, 7.0)
4 1, 2, 3 (1.8, 2.3) 4, 5 (4.2, 6.0)
5 1, 2, 3 (1.8, 2.3) 4, 5, 6 (4.3, 5.7)
6 1, 2, 3 (1.8, 2.3) 4, 5, 6, 7 (4.1, 5.4)
53. Example
Now the initial partition has changed, and the two
clusters at this stage having the following
characteristics:
53
Individual
Mean Vector
(centroid)
Cluster 1 1, 2, 3 (1.8, 2.3)
Cluster 2 4, 5, 6, 7 (4.1, 5.4)
54. Example
Individual
Distance to mean
(centroid) of
Cluster 1
Distance to mean
(centroid) of
Cluster 2
1 1.5 5.4
2 0.4 4.3
3 2.1 1.8
4 5.7 1.8
5 3.2 0.7
6 3.8 0.6
7 2.8 1.1
54
But we cannot yet be sure that each individual has been assigned
to the right cluster. So, we compare each individual’s distance to
its own cluster mean and to
that of the opposite cluster.
55. Example
The iterative relocation would now continue from this new
partition until no more relocations occur. However, in this
example each individual is now nearer its own cluster mean than
that of the other cluster and the iteration stops, choosing the
latest partitioning as the final cluster solution.
55
Individual
Mean Vector
(centroid)
Cluster 1 1, 2 (1.3, 1.5)
Cluster 2 3, 4, 5, 6, 7 (3.9, 5.1)
56. Applications
Clustering helps marketers improve their customer base
and work on the target areas. It helps group people
(according to different criteria’s such as willingness,
purchasing power etc.) based on their similarity in many
ways related to the product under consideration.
Clustering helps in identification of groups of houses on
the basis of their value, type and geographical locations.
Clustering is used to study earth-quake. Based on the
areas hit by an earthquake in a region, clustering can
help analyse the next probable location where
earthquake can occur.
56
57. Random Forest
Contents
Random Forest Introduction
Pseudocode
Prediction Pseudocode
Example
Random Forest vs Decision Tree
Advantages
Disadvantages
Application
57
58. Random Forest
Random forest algorithm is a supervised classification and
regression algorithm.
Randomly creates a forest with several trees.
58
59. Random Forest pseudocode
Randomly select “k” features from total “m” features.
Where k << m
Among the “k” features, calculate the node “d” using
the best split point.
Split the node into daughter nodes using the best split.
Repeat 1 to 3 steps until “l” number of nodes has been
reached.
Build forest by repeating steps 1 to 4 for “n” number
times to create “n” number of trees.
59
60. Prediction pseudocode
To perform prediction using the trained random forest
algorithm uses the below pseudocode.
Takes the test features and use the rules of each
randomly created decision tree to predict the oucome
and stores the predicted outcome (target)
Calculate the votes for each predicted target.
Consider the high voted predicted target as the final
prediction from the random forest algorithm.
60
61. Example
61
Day Outlook Humidity Wind Play
D1 Sunny High Weak Yes
D2 Sunny High Strong No
D3 Overcast High Weak Yes
D4 Rain High Weak Yes
D5 Rain Normal Weak Yes
D6 Rain Normal Strong No
D7 Overcast Normal Strong Yes
D8 Sunny High Weak No
D9 Sunny Normal Weak Yes
D10 Rain Normal Weak Yes
D11 Sunny Normal Strong Yes
D12 Overcast High Strong Yes
D13 Overcast Normal Weak Yes
D14 Rain High Strong No
62. Example
Whether the game will happen if the weather condition
is
Outlook = Rain, Humidity = High, Wind = weak
Play=?
Step 1: divide the data into smaller subsets
Step 2: every subsets need not be distinct, some
subsets may be overlapped
62
64. Advantages
Random forests is considered as a highly accurate and
robust method.
It does not suffer from the overfitting problem.
The algorithm can be used in both classification and
regression problems.
Random forests can also handle missing values.
You can get the relative feature importance, which helps
in selecting the most contributing features for the
classifier.
64
65. Disadvantages
It can take longer than expected time to compute a large
number of trees.
The model is difficult to interpret compared to a
decision tree.
65
66. Random forest vs Decision
Trees
Random forests is a set of multiple decision trees.
Deep decision trees may suffer from overfitting, but
random forests prevents overfitting by creating trees on
random subsets.
Decision trees are computationally faster.
Random forests is difficult to interpret, while a decision
tree is easily interpretable and can be converted to
rules.
66