Unsupervised Learning: Clustering Analysis with Mall Customers Data

Unsupervised learning is a machine learning technique in data science that is used to identify patterns in unlabeled data. In contrast to supervised learning, computers need to create an algorithm to detect patterns in unlabeled data. Besides, computers did not learn from patterns in a desired output given an input. A well-known method of unsupervised learning is clustering. Clustering entails the process of finding groups in a dataset without predefined labels. The method is used to define groups that have similar characteristics. Let’s illustrate this with a small dataset on ‘Mall Customers‘.

Mall Customers Data

Recall the scatterplot of mall customers data from publication ‘Data Visualization in RStudio‘. The plot displays the dispersion of mall customers’ spending score for a certain annual income. Customers with a high annual income do not necessarily have a high spending score; they could have a low spending score. As the non-identical blue colored dots in the scatterplot indicate, only younger customers are among the top spending scores.

Mall Customers Dataset – Kaggle Data Science Community

In publication ‘Data Visualization in RStudio‘, I applied the ggplot function to create the scatterplot containing ‘income’ on the x-axis and ‘spending score’ on the y-axis. The mall customers dataset is a rather small dataset including 200 unique observations. As the plot illustrates, there are 5 main groups of mall customers:

  • Mall customers with a low income and high spending score;

  • Mall customers with a high income and a high spending score;

  • Mall customers with a low income and a low spending score;

  • Mall customers with a high income and a low spending score;

  • Mall customers with a mediocre income and median spending score.

Another way to identify groups with similar characteristics is to perform a clustering analysis. I start the clustering analysis with importing the libraries into RStudio.

# import libraries
>library(dplyr)
>library(ggplot2)
>library(gridExtra)
>library(skmeans)

The dplyr library in RStudio is to manipulate the dataset. Annual income is the most relevant variable that has effect on a customer’s spending score. Therefore, I create a data frame with variables from which I want to derive a pattern. The fourth and the fifth column of the ‘Mall Customers‘ dataset contains information about annual income and spending score.

#Create a dataset with the variables annual income and spending score
>MCdata <- Mall_Customers[, 4:5]

Classification and Clustering: What is the difference?

In contrast to classification, clustering does not assign a data point to a predefined class. Classification algorithms predict categorical class labels. The computer has learned from the characteristics of input data; it can assign a class label to each extra item. For example, to which class does a new customer belong based on the characteristics? Does the new customer belong to the category of high loyalty customers or not?

K-Nearest Neighbors is a widely used classification method to determine the class of a data point. The dataset is split into a training data and testing data. The testing data contains new customers. The K-Nearest Neighbors algorithm is used how to label data points on new customers while looking at a bunch of labeled data points in similar instances. The algorithm searches for K-observations in the training data to find the nearest measurement to the unknown and new data point, or in this case the new customer. The closest distance to neighbor data points determines the class of the new data point.

Scree Plot

A scree plot is a single line that shows the fraction of the variance for each number of components, or factor. The fraction of the total variance decreases by adding more factors.
The optimal number of factors lies in the ‘elbow‘ of the line. In geology, the term scree refers to a slope of loose rock debris at the base of a cliff or a steep incline. Every additional factor that is added to the model after the value of the ‘elbow‘, or at the scree, contains little additional information. In the scree plot for mall customers analysis, the optimal number of clusters is five.

Mall Customers Dataset – Kaggle Data Science Community

The values to turn out the scree plot, as in the figure above, are stored in an empty vector. There are 2 ways to create an empty vector. You can use the rep() function or the vector() function and assign it to a vector name.

#Create an empty vector
>empty_vector_for_scree < – rep(0,9)

The rep() function replicates the value x number of times. In this case, the empty vector contains 9 zeros.

#The result of the function when not assigning a vector name.
>rep(0,9)
[1] 000000000

Another way to create an empty vector is to use the vector() function. The vector must be a numeric vector with 9 numbers.

#Create an empty vector
>empty_vector_for_scree <- vector(“numeric”, length = 9)

After creating an empty vector, the zeros in the empty vector are replaced by values resulting from the loop function. The loop function tests whether or not a certain expression has met a condition.

In general, a loop function looks like this:
> for(condition) {
expression
}

The condition in the loop function contains the variable and the sequence of values. In this example, the variable is set to i. There are 9 elements or factors, ranging from 1 to 9, on the x-axis of the scree plot. The expression in the loop function refers to the ‘Within Sum of Squares’. The loop function is an iterative process which means that the process is repeated multiple times (in this case 9 times) to generate a sequence of outcomes.

>#Calculate the Sum of Squares for 9 factors
>for (i in 1:9){
MCdataloop <- kmeans(MCdata, i)
empty_vector_for_screeplot[i] <- MCdataloop$tot.withinss
}

K-means

K-means is the most popular function for clustering analysis. The main goal of the k-means algorithm is to find groups in data. Data points are grouped based on a similar characteristic. The resulting groups are called clusters. Each data point within a cluster has more in common with other data points within the same cluster than with data points from other clusters. The most representative point within a cluster is called a centroid. A centroid is the mean value of a cluster. A data point is considered to be in a particular cluster when its value is closer to that particular cluster’s centroid than to any other centroid. The output displayed in the figure below shows the cluster mean of annual income and spending score for each of the 5 clusters.

#k-means clustering
>CustomerClusters <- kmeans(MCdata, 5)

Mall Customers Dataset – Kaggle Data Science Community

In addition, printing ‘Customer Clusters‘ displays the size of the clusters and the within sum of squares for each cluster. The within cluster sum of squares for a particular cluster is bigger when the variability of observations is higher. Also, the within cluster sum of squares is affected by the number of observations; if more observations are added to a cluster, the within sum of squares will become larger. The size of clusters displays the number of observations in each cluster.

Visualizing Cluster Analysis

The plot below exhibits the 5 clusters in the ‘Mall Customers‘ dataset. Each cluster is separated by color. Hence, cluster 2 and cluster 5 have a larger dispersion among the observations within the cluster and a higher within sum of squares. Cluster 4 contains most observations.

#Create a scatterplot with clusters
>ggplot(Mall_Customers, aes(x= Annual_Income, y = Spending_Score)) + geom_point(aes(color=as.factor(CustomerClusters$cluster)) + ggtitle(“Mall Customers: Cluster Analysis)+xlab(“Annual Income”) + ylab(“Spending Score”)+scale_color_discrete(name = “Clusters”, labels=c(‘one’, ’two’, ‘three’, ‘four’, ‘five’))

Mall Customers Dataset – Kaggle Data Science Community