Unsupervised Machine Learning

Unsupervised Machine Learning

Unsupervised Machine Learning

  • Machine Learning
    • Introduction to Machine Learning
    • Types of Machine Learning
    • Scope of Machine Learning
  • Supervised Machine Learning
    • Types of Supervised Machine Learning algorithms
    • A working example of a Decision Tree (DT) using R
    • Applications of Supervised Machine Learning

Continuing with the previous topic of Machine Learning, we will take you through another important category of Machine Learning i.e. – Unsupervised Machine Learning.

Unsupervised Machine Learning

Unsupervised learning is a type of machine learning algorithm that is used for drawing inferences from datasets consisting of input data without labeled responses.

The most common unsupervised learning method is cluster analysis, which is used for exploratory data analysis to find hidden patterns or grouping in data. The clusters are modeled using a measure of similarity which is defined by metrics such as Euclidean or probabilistic distance.

Common clustering algorithms are:

  • Hierarchical clustering: builds a multilevel hierarchy of clusters by creating a cluster tree
  • k-Means clustering: partitions data into k distinct clusters based on the distance to the centroid of a cluster
  • Gaussian mixture models: models clusters as a mixture of multivariate normal density components
  • Self-organizing maps: uses neural networks that learn the topology and distribution of the data
  • Hidden Markov models: uses observed data to recover the sequence of states

Unsupervised learning methods are used in bioinformatics for sequence analysis and genetic clustering, in data mining for sequence and pattern mining, in medical imaging for image segmentation, and in computer vision for object recognition.

Application of Unsupervised Learning:

k-Means Clustering:

Let’s start working with the most popular clustering algorithm which is k-means. For the sake of understanding, we are taking the wholesale customer data and the data source link is given below:

Data Reference Link: 

The tool that we are going to use is RStudio-0.99.903 and the language is ‘R-3.0’.

We are going to Import the “Wholesale customer data.csv” into RStudio and check the basic info about the data. The “Wholesale customer data” is about the different types of product categories sold in different regions via different channels. Remember figuring out shapes from ink blots? k means is somewhat similar to this activity. You look at the shape and spread to decipher how many different clusters/population are present and also, we come to know the majority of the data points belong to which part of the data.

sales

Summary of the data – The summary will give us a clear picture of the data like its mean, median, quartiles etc. By doing that we can at least have an idea about the data.

sumary-sale

Attribute Information:

  • MILK: annual spending on milk products (Continuous)
  • GROCERY: annual spending on grocery products (Continuous)
  • FROZEN: annual spending on frozen products (Continuous)
  • DETERGENTS_PAPER: annual spending on detergents and paper products (Continuous)
  • DELICATESSEN: annual spending on delicatessen products (Continuous)
  • CHANNEL: customer sale Channel – (Hotel/Restaurant/Cafe) or Retail channel (Nominal)
  • REGION: customer sale Region – Other (Nominal)

Data Preparation:

If we go through the summary report, there’s obviously a big difference for the top customers in each category (e.g. Milk goes from a min of 55 to a max of 73498).  Normalizing/scaling the data won’t necessarily remove those outliers.  Doing a log transformation might help to deal with such types of data.   We could also remove those customers completely. From a business perspective, you don’t really need a clustering algorithm to identify what your top customers are buying.  So, what we can do is we can remove the top customers from every column.

customers

Here, we have the top customers list which we have removed from the data by using the user-defined R function i.e. “top_customers”, this is because these customers may influence the analysis.

customer-sale

So, now our data is prepared and ready to apply k-means clustering.

seed

Using the k-Means clustering algorithm below are the clusters formed.

cluster-formed

From the above clustering plot, we can see that there is a relatively well-defined set of clusters. The k-Means clustering algorithm has clustered the data in five different clusters with 73.5 % clustering strength.

graph-formed

We can further iterate by changing the number of centers to improve the strength of the clusters. We can check the above scree plot to have an idea of the most significant number of centers for k-Means clustering. By examining the scree plot, the most significant number of centers is 5. So, in this way, we can segment the data using cluster analysis.

What else can we do with Unsupervised Machine Learning?

  • In the cancer research field in order to classify patients into subgroups according to their gene expression profile. This can be useful for identifying the molecular profile of patients with good or bad prognostic, as well as for understanding the disease.
  • In marketing for market segmentation identifying subgroups of customers with similar profiles and who might be receptive to a particular form of advertising.

Conclusion:

Clearly, Machine Learning is an incredibly powerful tool. In the coming years, it promises to help solve some of our most pressing and day-to-day life problems, as well as open up whole new worlds of opportunity.

For more insights on Unsupervised Machine Learning feel free to contact us or email us at sales@bistasolutions.com.