Which Machine Learning Algorithm to Use

What machine learning algorithm should I use?

•The answer to the question “What machine learning algorithm should I use?” is always “It depends.”

•It depends on the size, quality, and nature of the data.

•It depends on what you want to do with the answer.

•It depends on how the math of the algorithm was translated into instructions for the computer you are using. And it depends on how much time you have.

  • Even the most experienced data scientists can’t tell which algorithm will perform best before trying them

Flavors of machine learning

Supervised

•Supervised learning algorithms make predictions based on a set of examples. For instance, historical stock prices can be used to hazard guesses at future prices. Each example used for training is labeled with the value of interest — in this case the stock price.

•A supervised learning algorithm looks for patterns in those value labels. It can use any information that might be relevant — the day of the week, the season, the company’s financial data, the type of industry, the presence of disruptive geopolitical events — and each algorithm looks for different types of patterns.

  • After the algorithm has found the best pattern it can, it uses that pattern to make predictions for unlabeled testing data — tomorrow’s prices.

Classification

•When the data are being used to predict a category, supervised learning is also called classification. This is the case when assigning an image as a picture of either a ‘cat’ or a ‘dog’.

•When there are only two choices, it’s called two-class or binomial classification.

  • When there are more categories, as e.g. when predicting the winner of the NCAA March Madness tournament, this problem is known as multi-class classification.

Regression

  • When a value is being predicted, as with stock prices, supervised learning is called regression.

Anomaly Detection

•Sometimes the goal is to identify data points that are simply unusual. In fraud detection, for example, any highly unusual credit card spending patterns are suspect.

•The possible variations are so numerous and the training examples so few, that it’s not feasible to learn what fraudulent activity looks like.

  • The approach that anomaly detection takes is to simply learn what normal activity looks like (using a history non-fraudulent transactions) and identify anything that is significantly different.

Unsupervised Learning

•In unsupervised learning, data points have no labels associated with them. Instead, the goal of an unsupervised learning algorithm is to organize the data in some way or to describe its structure.

  • This can mean grouping it into clusters or finding different ways of looking at complex data so that it appears simpler or more organized.

Reinforcement Learning

•In reinforcement learning, the algorithm gets to choose an action in response to each data point.

•The learning algorithm also receives a reward signal a short time later, indicating how good the decision was.

•Based on this, the algorithm modifies its strategy in order to achieve the highest reward.

•Reinforcement learning is common in robotics, where the set of sensor readings at one point in time is a data point, and the algorithm must choose the robot’s next action.

  • It is also a natural fit for Internet of Things applications.

Considerations when Choosing An Algorithm

Accuracy

•Getting the most accurate answer possible isn’t always necessary.

•Sometimes an approximation is adequate, depending on what you want to use it for. If that’s the case, you may be able to cut your processing time dramatically by sticking with more approximate methods.

•Another advantage of more approximate methods is that they naturally tend to avoid overfitting.

Training Time

•The number of minutes or hours necessary to train a model varies a great deal between algorithms.

•Training time is often closely tied to accuracy — one typically accompanies the other.

•In addition, some algorithms are more sensitive to the number of data points than others.

•When time is limited it can drive the choice of algorithm, especially when the data set is large.

Linearity

•Lots of machine learning algorithms make use of linearity. Linear classification algorithms assume that classes can be separated by a straight line (or its higher-dimensional analog).

•These include logistic regression and support vector machines (as implemented in Azure Machine Learning).

•Linear regression algorithms assume that data trends follow a straight line.

  • These assumptions aren’t bad for some problems, but on others they bring accuracy down.

Non-linear class boundary — relying on a linear classification algorithm would result in low accuracy

Data with a nonlinear trend — using a linear regression method would generate much larger errors than necessary


Figure: Data with Non-linear Trend

Number of Parameters

  • Parameters are the knobs a data scientist gets to turn when setting up an algorithm. They are numbers that affect the algorithm’s behavior, such as error tolerance or number of iterations, or options between variants of how the algorithm behaves.
  • The training time and accuracy of the algorithm can sometimes be quite sensitive to getting just the right settings. Typically, algorithms with large numbers parameters require the most trial and error to find a good combination.
  • Parameter Swapping tries to automatically tries all parameter combinations at whatever granularity you choose. While this is a great way to make sure you’ve spanned the parameter space, the time required to train a model increases exponentially with the number of parameters.
  • The upside is that having many parameters typically indicates that an algorithm has greater flexibility. It can often achieve very good accuracy. Provided you can find the right combination of parameter settings

Number of Features

  • For certain types of data, the number of features can be very large compared to the number of data points. This is often the case with genetics or textual data.
  • The large number of features can bog down some learning algorithms, making training time unfeasibly long. Support Vector Machines are particularly well suited to this case (see below).

Special Cases

ØSome learning algorithms make particular assumptions about the structure of the data or the desired results.

ØIf you can find one that fits your needs, it can give you more useful results, more accurate predictions, or faster training times.

Algorithm properties in the following tables

•shows excellent accuracy, fast training times, and the use of linearity

 shows good accuracy and moderate training times






Logistic Regression

Although it confusingly includes ‘regression’ in the name, logistic regression is actually a powerful tool for two-class and multiclass classification. It’s fast and simple. The fact that it uses an ‘S’-shaped curve instead of a straight line makes it a natural fit for dividing data into groups. Logistic regression gives linear class boundaries, so when you use it, make sure a linear approximation is something you can live with


Trees Forests and Jungles

•Decision forests (regression, two-class, and multiclass), decision jungles (two-class and multiclass), and boosted decision trees (regression and two-class) are all based on decision trees, a foundational machine learning concept.

  • There are many variants of decision trees, but they all do the same thing — subdivide the feature space into regions with mostly the same label. These can be regions of consistent category or of constant value, depending on whether you are doing classification or regression.

Avoiding Overfitting

•Because a feature space can be subdivided into arbitrarily small regions, it’s easy to imagine dividing it finely enough to have one data point per region. This is an extreme example of overfitting.

•In order to avoid this, a large set of trees are constructed with special mathematical care taken that the trees are not correlated. The average of this “decision forest” is a tree that avoids overfitting.

  • Decision forests can use a lot of memory. Decision jungles are a variant that consumes less memory at the expense of a slightly longer training time.
  • •Boosted decision trees avoid overfitting by limiting how many times they can subdivide and how few data points are allowed in each region.
  • •The algorithm constructs a sequence of trees, each of which learns to compensate for the error left by the tree before. The result is a very accurate learner that tends to use a lot of memory. For the full technical description, check out Friedman‘s original paper
  • •Fast forest quantile regression is a variation of decision trees for the special case where you want to know not only the typical (median) value of the data within a region, but also its distribution in the form of quantiles

Neural Networks and Perceptions

•Neural networks are brain-inspired learning algorithms covering multiclass, two-class, and regression problems. They come in an infinite variety, but the neural networks are also come in DAG format

  • That means that input features are passed forward (never backward) through a sequence of layers before being turned into outputs. In each layer, inputs are weighted in various combinations, summed, and passed on to the next layer.
  • This combination of simple calculations results in the ability to learn sophisticated class boundaries and data trends, seemingly by magic. Many-layered networks of this sort perform the “deep learning” that fuels so much tech reporting and science fiction.
  • •This high performance doesn’t come for free, though. Neural networks can take a long time to train, particularly for large data sets with lots of features.
  • •They also have more parameters than most algorithms, which means that parameter sweeping expands the training time a great deal.
  • •And for those overachievers who wish to specify their own network structure, the possibilities are inexhaustible.

Support Vector Machines

  • Support vector machines (SVMs) find the boundary that separates classes by as wide a margin as possible. When the two classes can’t be clearly separated, the algorithms find the best boundary they can.
  • Because it makes this linear approximation, it is able to run fairly quickly.
  • Where it really shines is with feature-intense data, like text or genomic.
  • In these cases SVMs are able to separate classes more quickly and with less overfitting than most other algorithms, in addition to requiring only a modest amount of memory

Bayesian Methods

•Bayesian methods have a highly desirable quality: they avoid overfitting. They do this by making some assumptions beforehand about the likely distribution of the answer. Another byproduct of this approach is that they have very few parameters.

•Bayesian algorithms for both classification (Two-class Bayes’ point machine) and regression (Bayesian linear regression). Note that these assume that the data can be split or fit with a straight line.

This post covers very basics of Machine Learning. This is still conventional machine learning and does not talk about Modern Machine Learning which covers areas like CNN/Deep Learning/RNN/LSTM and so on.




To view or add a comment, sign in

More articles by Akash Mavle

Insights from the community

Others also viewed

Explore topics