- Machine Learning
- Introduction to Machine Learning
- Types of Machine Learning
- Scope of Machine Learning
- Supervised Machine Learning
- Types of Supervised Machine Learning algorithms
- Working example of Decision Tree (DT) using R
- Applications of Supervised Machine Learning
Continuing with the previous topic of Machine Learning, we will take you through another important category of Machine Learning i.e. – Unsupervised Machine Learning.
Unsupervised Machine Learning
Unsupervised learning is a type of machine learning algorithm that is used for drawing inferences from datasets consisting of input data without labelled responses.
The most common unsupervised learning method is cluster analysis, that is used for exploratory data analysis to find hidden patterns or grouping in data. The clusters are modelled using a measure of similarity which is defined by metrics such as Euclidean or probabilistic distance.
Common clustering algorithms are:
- Hierarchical clustering: builds a multilevel hierarchy of clusters by creating a cluster tree
- k-Means clustering: partitions data into k distinct clusters based on distance to the centroid of a cluster
- Gaussian mixture models: models clusters as a mixture of multivariate normal density components
- Self-organizing maps: uses neural networks that learn the topology and distribution of the data
- Hidden Markov models: uses observed data to recover the sequence of states
Unsupervised learning methods are used in bioinformatics for sequence analysis and genetic clustering, in data mining for sequence and pattern mining, in medical imaging for image segmentation, and in computer vision for object recognition.
Application of Unsupervised Learning:
Let’s start working with the most popular clustering algorithm that is k-means. For the sake of understanding, we are taking the wholesale customer data and the data source link is given below:
Data Reference Link: https://archive.ics.uci.edu/ml/machine-learning-databases/00292/
The tool that we are going to use is RStudio-0.99.903 and language is ‘R-3.0’.
We are going to Import the “Wholesale customer data.csv” into RStudio and checking the basic info about the data. The “Wholesale customer data” is about the different types of product categories sold in different regions via different channels. Remember figuring out shapes from ink blots? k means is somewhat similar to this activity. You look at the shape and spread to decipher how many different clusters/population are present and also, we come to know the majority of the data points belongs to which part of the data.
Summary of the data – Summary will give us the clear picture of the data like its mean, median, quartiles etc. By doing that we can at least have an idea about the data.
- MILK: annual spending on milk products (Continuous)
- GROCERY: annual spending on grocery products (Continuous)
- FROZEN: annual spending on frozen products (Continuous)
- DETERGENTS_PAPER: annual spending on detergents and paper products (Continuous)
- DELICATESSEN: annual spending on and delicatessen products (Continuous)
- CHANNEL: customer sale Channel – (Hotel/Restaurant/Cafe) or Retail channel (Nominal)
- REGION: customer sale Region – Other (Nominal)
If we go through the summary report, there’s obviously a big difference for the top customers in each category (e.g. Milk goes from a min of 55 to a max of 73498). Normalizing/scaling the data won’t necessarily remove those outliers. Doing a log transformation might help to deal such types of data. We could also remove those customers completely. From a business perspective, you don’t really need a clustering algorithm to identify what your top customers are buying. So, what we can do is we can remove the top customers from every column.
Here, we have the top customers list which we have removed from the data by using user-defined R function i.e. “top_customers”, this is because these customers may influence the analysis.
So, now our data is prepared and ready to apply k-means clustering.
Using the k-Means clustering algorithm below are the clusters formed.
From the above clustering plot, we can see that there is a relatively well-defined set of clusters. The k-Means clustering algorithm has clustered the data in five different clusters with 73.5 % clustering strength.
We can further iterate by changing the number of centers to improve the strength of the clusters. We can check the above scree plot to have an idea of the most significant number of centers for k-Means clustering. By examine the scree plot, the most significant number of centres is 5. So, by this way, we can segment the data using cluster analysis.
What else can we do with Unsupervised Machine Learning?
- In cancer research field in order to classify patients in subgroups according to their gene expression profile. This can be useful for identifying the molecular profile of patients with good or bad prognostic, as well as for understanding the disease.
- In marketing for market segmentation by identifying subgroups of customers with similar profiles and who might be receptive to a particular form of advertising.
Clearly, Machine Learning is an incredibly powerful tool. In the coming years, it promises to help solve some of our most pressing and day to day life problems, as well as open up whole new worlds of opportunity.
For more insights on Unsupervised Machine Learning feel free to get in touch with us through email@example.com, you can also write your feedback on how this blog has helped you at firstname.lastname@example.org.