Comparative study of clustering techniques and autoencoder
K-MEANS CLUSTER ANALYSIS
K-means clustering algorithm is one of the simplest and most popular unsupervised machine learning algorithm. It is a feature based clustering approach. The objective of k-means is to group the data points and discover the underlying pattern. To group the data points, k-means looks for a fixed number of clusters in the dataset.
The number of centroids can be fixed or can be found using elbow method. The basic idea behind elbow method is that the number of elements in the cluster decreases with the increase in ‘k’ value. The point where the distortion declines the most is the elbow point. The centroids mentioned earlier are the mean values of the item categorised into it.
The algorithm initializes the centroids and calculates the distance between the data points and the each centroid to determine the cluster. The algorithm finds the optimal centre and calculates the distance again. If the data points fall in the same cluster, the process stops else the process continues iteratively.
The algorithm is fast and easy to understand. It is relatively efficient. It scales data and guarantees convergence. K-means gives the best results when the data points are distinct or well separated. The disadvantages of this algorithm is being dependent only one the initial values. The clusters get dragged easily by the outliers.
K means algorithm is very popular and used in a variety of applications such as market segmentation, document clustering, image segmentation and image compression, etc. The goal usually when we undergo a cluster analysis is either:
- Get a meaningful intuition of the structure of the data we’re dealing with.
- Cluster-then-predict where different models will be built for different subgroups if we believe there is a wide variation in the behaviors of different subgroups. An example of that is clustering patients into different subgroups and build a model for each subgroup to predict the probability of the risk of having heart attack.
HIERARCHICAL CLUSTERING
Hierarchical clustering is a similarity based clustering approach. The objective of hierarchical clustering is to group the data points that are closest in distance. There are two types of hierarchical clustering, Divisive and Agglomerative.
In divisive or top-down clustering method we assign all of the observations to a single cluster and then partition the cluster to two least similar clusters. Finally, we proceed recursively on each cluster until there is one cluster for each observation. There is evidence that divisive algorithms produce more accurate hierarchies than agglomerative algorithms in some circumstances but is conceptually more complex.
In agglomerative or bottom-up clustering method we assign each observation to its own cluster. Then, compute the similarity (e.g., distance) between each of the clusters and join the two most similar clusters. First, calculate the distance between all the data points and store it in the distance matrix. Search through the distance matrix and find the two most similar objects. Group them and find the distance between this cluster and all the other data points. Update the distance matrix. Continue the process until all the data points fall under a single cluster.
Hierarchical clustering works well for small dataset. The major advantage of hierarchical clustering is that the user need not know anything about the dataset in advance and specify certain fields like determining ‘k’ value. Once the dataset becomes significantly large, due to computational difficulties, the hierarchical clustering becomes slow and does not favourable results.
TRADITIONAL CLUSTERING APPROACH
The most traditional and popular clustering methods are k-means clustering and hierarchical clustering. This approach begins to break down when dataset is time series because time series data has a unique characteristic where current data points depends on the previous data points. For example, In case of gas turbines, based on the historic data, the machine learning model can be build to identify the underling pattern and can the used to detect anomalies for the new data points .
AUTOENCODER
Auto encoder is an artificial neural network used for unsupervised feature learning algorithms. It seeks to learn the compressed representation of an input. Auto encoder are an unsupervised learning method, although technically, they are trained using supervised learning methods, referred to as self-supervised. They are typically trained as part of a broader model that attempts to recreate the input.
The design of the autoencoder model purposefully makes this challenging by restricting the architecture to a bottleneck at the midpoint of the model, from which the reconstruction of the input data is performed.
There are seven types of autoencoder. They are as follows:
- Denoising autoencoder
- Sparse Autoencoder
- Deep Autoencoder
- Contractive Autoencoder
- Undercomplete Autoencoder
- Convolutional Autoencoder
- Variational Autoencoder
Simple Autoencoder is a feed-forward, non recurrent neural network with an input layer, output layer and one or more hidden layers. In autoencoder, the output layer has same number of nodes as input layer. They reconstruct their own input instead of predicting the target Y. Thus, autoencoder is the learning model.
Data Denoising and dimensionality reduction for data visualisation are considered as two main interesting practical applications of autoencoders.
LSTM AUTO ENCODER
An Long Short Term Memory(LSTM) Autoencoder is a type of recurrent neural network. It is an implementation of an autoencoder for sequence data using an Encoder-Decoder LSTM architecture.
For a given dataset of sequences, an encoder-decoder LSTM is configured to read the input sequence, encode it, decode it, and recreate it. The performance of the model is evaluated based on the model’s ability to recreate the input sequence.
Once the model achieves a desired level of performance recreating the sequence, the decoder part of the model may be removed, leaving just the encoder model. This model can then be used to encode input sequences to a fixed-length vector.
The resulting vectors can then be used in a variety of applications, not least as a compressed representation of the sequence as an input to another supervised learning model.
It is specifically used for multivariate time series analysis.
GMMVA AUTO ENCODER
Gaussian Mixture Model Variational Autoencoders(GMMVA) is a deep-neural network. The model first initializes random weights for the neural network. The input data is feed into the neural network and encoding is performed on that data. Latent vector is generated that follows Gaussian unit distribution. These latent vectors are fed into the decoder to generate sample input. Density based clustering is done based on different Gaussian that were formed in the different latent vector space.
A Gaussian Mixture Model variational autoencoder (GMMVA) provides a way of learning the probability distribution p(x,z)p(x,z) relating an input xx to its latent representation zz. In particular, the decoder dd maps an input xx to a distribution on zz. A typical decoder will output parameters (μ,σ)=d(x)(μ,σ)=d(x), representing the Gaussian distribution N(μ,σ)N(μ,σ); this distribution is used as our approximation for p(z|x)p(z|x).
AutoEncoders improve the performance of the model, yield plausible filters and builds model based on data and not on pre-defined features. It gives more filters that may fit the data better.
The only disadvantage of the autoencoder is the additional computational time. Lot of time taken to build a model with the dataset containing multivariate features.