Species Segmentation with K-means Clustering

Sujay Desai

The Iris flower dataset is one of the most popular ones for machine learning: https://en.wikipedia.org/wiki/Iris_flower_data_set

There are 4 features: sepal length, sepal width, petal length, and petal width.

Import the relevant libraries

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
from sklearn.cluster import KMeans

Load the data

Load data from the csv file: 'iris_dataset.csv'.

In [3]:
# Load the data
data = pd.read_csv('iris-dataset.csv')
# Check the data
data
Out[3]:
sepal_length sepal_width petal_length petal_width
0 5.1 3.5 1.4 0.2
1 4.9 3.0 1.4 0.2
2 4.7 3.2 1.3 0.2
3 4.6 3.1 1.5 0.2
4 5.0 3.6 1.4 0.2
... ... ... ... ...
145 6.7 3.0 5.2 2.3
146 6.3 2.5 5.0 1.9
147 6.5 3.0 5.2 2.0
148 6.2 3.4 5.4 2.3
149 5.9 3.0 5.1 1.8

150 rows × 4 columns

Plot the data

We will try to cluster the iris flowers by the shape of their sepal.

In [4]:
# Create a scatter plot based on two corresponding features (sepal_length and sepal_width; OR petal_length and petal_width)
plt.scatter(data['sepal_length'],data['sepal_width'])
# Name your axes
plt.xlabel('Length of sepal')
plt.ylabel('Width of sepal')
plt.show()

Clustering (unscaled data)

Separate the original data into 2 clusters.

In [5]:
# create a variable which will contain the data for the clustering
x = data.copy()
# create a k-means object with 2 clusters
kmeans = KMeans(2)
# fit the data
kmeans.fit(x)
Out[5]:
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
       n_clusters=2, n_init=10, n_jobs=None, precompute_distances='auto',
       random_state=None, tol=0.0001, verbose=0)
In [6]:
# create a copy of data, so we can see the clusters next to the original data
clusters = data.copy()
# predict the cluster for each observation
clusters['cluster_pred']=kmeans.fit_predict(x)
In [8]:
# create a scatter plot based on two corresponding features (sepal_length and sepal_width; OR petal_length and petal_width)
plt.scatter(clusters['sepal_length'], clusters['sepal_width'], c= clusters ['cluster_pred'], cmap = 'rainbow')
plt.xlabel('Length of sepal')
plt.ylabel('Width of sepal')
Out[8]:
Text(0, 0.5, 'Width of sepal')
  • It seems as if the two clusters are split along the diagonal:
    • Large sepal width, small sepal length
    • Small sepal width, learge sepal length
  • Even without standarding the variables, it seems that the k-means algorithm did not cluster on the basis of a single variable, in which case we might see a completely horizonal or vertical separation between the clusters. This makes sense since the sepal width and sepal length are pretty much on the same scale of magnitude.

Standardize the variables

Import and use the scale function from sklearn to standardize the data.

In [9]:
# import some preprocessing module
from sklearn import preprocessing

# scale the data for better results
x_scaled = preprocessing.scale(data)
x_scaled
Out[9]:
array([[-9.00681170e-01,  1.03205722e+00, -1.34127240e+00,
        -1.31297673e+00],
       [-1.14301691e+00, -1.24957601e-01, -1.34127240e+00,
        -1.31297673e+00],
       [-1.38535265e+00,  3.37848329e-01, -1.39813811e+00,
        -1.31297673e+00],
       [-1.50652052e+00,  1.06445364e-01, -1.28440670e+00,
        -1.31297673e+00],
       [-1.02184904e+00,  1.26346019e+00, -1.34127240e+00,
        -1.31297673e+00],
       [-5.37177559e-01,  1.95766909e+00, -1.17067529e+00,
        -1.05003079e+00],
       [-1.50652052e+00,  8.00654259e-01, -1.34127240e+00,
        -1.18150376e+00],
       [-1.02184904e+00,  8.00654259e-01, -1.28440670e+00,
        -1.31297673e+00],
       [-1.74885626e+00, -3.56360566e-01, -1.34127240e+00,
        -1.31297673e+00],
       [-1.14301691e+00,  1.06445364e-01, -1.28440670e+00,
        -1.44444970e+00],
       [-5.37177559e-01,  1.49486315e+00, -1.28440670e+00,
        -1.31297673e+00],
       [-1.26418478e+00,  8.00654259e-01, -1.22754100e+00,
        -1.31297673e+00],
       [-1.26418478e+00, -1.24957601e-01, -1.34127240e+00,
        -1.44444970e+00],
       [-1.87002413e+00, -1.24957601e-01, -1.51186952e+00,
        -1.44444970e+00],
       [-5.25060772e-02,  2.18907205e+00, -1.45500381e+00,
        -1.31297673e+00],
       [-1.73673948e-01,  3.11468391e+00, -1.28440670e+00,
        -1.05003079e+00],
       [-5.37177559e-01,  1.95766909e+00, -1.39813811e+00,
        -1.05003079e+00],
       [-9.00681170e-01,  1.03205722e+00, -1.34127240e+00,
        -1.18150376e+00],
       [-1.73673948e-01,  1.72626612e+00, -1.17067529e+00,
        -1.18150376e+00],
       [-9.00681170e-01,  1.72626612e+00, -1.28440670e+00,
        -1.18150376e+00],
       [-5.37177559e-01,  8.00654259e-01, -1.17067529e+00,
        -1.31297673e+00],
       [-9.00681170e-01,  1.49486315e+00, -1.28440670e+00,
        -1.05003079e+00],
       [-1.50652052e+00,  1.26346019e+00, -1.56873522e+00,
        -1.31297673e+00],
       [-9.00681170e-01,  5.69251294e-01, -1.17067529e+00,
        -9.18557817e-01],
       [-1.26418478e+00,  8.00654259e-01, -1.05694388e+00,
        -1.31297673e+00],
       [-1.02184904e+00, -1.24957601e-01, -1.22754100e+00,
        -1.31297673e+00],
       [-1.02184904e+00,  8.00654259e-01, -1.22754100e+00,
        -1.05003079e+00],
       [-7.79513300e-01,  1.03205722e+00, -1.28440670e+00,
        -1.31297673e+00],
       [-7.79513300e-01,  8.00654259e-01, -1.34127240e+00,
        -1.31297673e+00],
       [-1.38535265e+00,  3.37848329e-01, -1.22754100e+00,
        -1.31297673e+00],
       [-1.26418478e+00,  1.06445364e-01, -1.22754100e+00,
        -1.31297673e+00],
       [-5.37177559e-01,  8.00654259e-01, -1.28440670e+00,
        -1.05003079e+00],
       [-7.79513300e-01,  2.42047502e+00, -1.28440670e+00,
        -1.44444970e+00],
       [-4.16009689e-01,  2.65187798e+00, -1.34127240e+00,
        -1.31297673e+00],
       [-1.14301691e+00,  1.06445364e-01, -1.28440670e+00,
        -1.44444970e+00],
       [-1.02184904e+00,  3.37848329e-01, -1.45500381e+00,
        -1.31297673e+00],
       [-4.16009689e-01,  1.03205722e+00, -1.39813811e+00,
        -1.31297673e+00],
       [-1.14301691e+00,  1.06445364e-01, -1.28440670e+00,
        -1.44444970e+00],
       [-1.74885626e+00, -1.24957601e-01, -1.39813811e+00,
        -1.31297673e+00],
       [-9.00681170e-01,  8.00654259e-01, -1.28440670e+00,
        -1.31297673e+00],
       [-1.02184904e+00,  1.03205722e+00, -1.39813811e+00,
        -1.18150376e+00],
       [-1.62768839e+00, -1.74477836e+00, -1.39813811e+00,
        -1.18150376e+00],
       [-1.74885626e+00,  3.37848329e-01, -1.39813811e+00,
        -1.31297673e+00],
       [-1.02184904e+00,  1.03205722e+00, -1.22754100e+00,
        -7.87084847e-01],
       [-9.00681170e-01,  1.72626612e+00, -1.05694388e+00,
        -1.05003079e+00],
       [-1.26418478e+00, -1.24957601e-01, -1.34127240e+00,
        -1.18150376e+00],
       [-9.00681170e-01,  1.72626612e+00, -1.22754100e+00,
        -1.31297673e+00],
       [-1.50652052e+00,  3.37848329e-01, -1.34127240e+00,
        -1.31297673e+00],
       [-6.58345429e-01,  1.49486315e+00, -1.28440670e+00,
        -1.31297673e+00],
       [-1.02184904e+00,  5.69251294e-01, -1.34127240e+00,
        -1.31297673e+00],
       [ 1.40150837e+00,  3.37848329e-01,  5.35295827e-01,
         2.64698913e-01],
       [ 6.74501145e-01,  3.37848329e-01,  4.21564419e-01,
         3.96171883e-01],
       [ 1.28034050e+00,  1.06445364e-01,  6.49027235e-01,
         3.96171883e-01],
       [-4.16009689e-01, -1.74477836e+00,  1.37235899e-01,
         1.33225943e-01],
       [ 7.95669016e-01, -5.87763531e-01,  4.78430123e-01,
         3.96171883e-01],
       [-1.73673948e-01, -5.87763531e-01,  4.21564419e-01,
         1.33225943e-01],
       [ 5.53333275e-01,  5.69251294e-01,  5.35295827e-01,
         5.27644853e-01],
       [-1.14301691e+00, -1.51337539e+00, -2.60824029e-01,
        -2.61192967e-01],
       [ 9.16836886e-01, -3.56360566e-01,  4.78430123e-01,
         1.33225943e-01],
       [-7.79513300e-01, -8.19166497e-01,  8.03701950e-02,
         2.64698913e-01],
       [-1.02184904e+00, -2.43898725e+00, -1.47092621e-01,
        -2.61192967e-01],
       [ 6.86617933e-02, -1.24957601e-01,  2.50967307e-01,
         3.96171883e-01],
       [ 1.89829664e-01, -1.97618132e+00,  1.37235899e-01,
        -2.61192967e-01],
       [ 3.10997534e-01, -3.56360566e-01,  5.35295827e-01,
         2.64698913e-01],
       [-2.94841818e-01, -3.56360566e-01, -9.02269170e-02,
         1.33225943e-01],
       [ 1.03800476e+00,  1.06445364e-01,  3.64698715e-01,
         2.64698913e-01],
       [-2.94841818e-01, -1.24957601e-01,  4.21564419e-01,
         3.96171883e-01],
       [-5.25060772e-02, -8.19166497e-01,  1.94101603e-01,
        -2.61192967e-01],
       [ 4.32165405e-01, -1.97618132e+00,  4.21564419e-01,
         3.96171883e-01],
       [-2.94841818e-01, -1.28197243e+00,  8.03701950e-02,
        -1.29719997e-01],
       [ 6.86617933e-02,  3.37848329e-01,  5.92161531e-01,
         7.90590793e-01],
       [ 3.10997534e-01, -5.87763531e-01,  1.37235899e-01,
         1.33225943e-01],
       [ 5.53333275e-01, -1.28197243e+00,  6.49027235e-01,
         3.96171883e-01],
       [ 3.10997534e-01, -5.87763531e-01,  5.35295827e-01,
         1.75297293e-03],
       [ 6.74501145e-01, -3.56360566e-01,  3.07833011e-01,
         1.33225943e-01],
       [ 9.16836886e-01, -1.24957601e-01,  3.64698715e-01,
         2.64698913e-01],
       [ 1.15917263e+00, -5.87763531e-01,  5.92161531e-01,
         2.64698913e-01],
       [ 1.03800476e+00, -1.24957601e-01,  7.05892939e-01,
         6.59117823e-01],
       [ 1.89829664e-01, -3.56360566e-01,  4.21564419e-01,
         3.96171883e-01],
       [-1.73673948e-01, -1.05056946e+00, -1.47092621e-01,
        -2.61192967e-01],
       [-4.16009689e-01, -1.51337539e+00,  2.35044910e-02,
        -1.29719997e-01],
       [-4.16009689e-01, -1.51337539e+00, -3.33612130e-02,
        -2.61192967e-01],
       [-5.25060772e-02, -8.19166497e-01,  8.03701950e-02,
         1.75297293e-03],
       [ 1.89829664e-01, -8.19166497e-01,  7.62758643e-01,
         5.27644853e-01],
       [-5.37177559e-01, -1.24957601e-01,  4.21564419e-01,
         3.96171883e-01],
       [ 1.89829664e-01,  8.00654259e-01,  4.21564419e-01,
         5.27644853e-01],
       [ 1.03800476e+00,  1.06445364e-01,  5.35295827e-01,
         3.96171883e-01],
       [ 5.53333275e-01, -1.74477836e+00,  3.64698715e-01,
         1.33225943e-01],
       [-2.94841818e-01, -1.24957601e-01,  1.94101603e-01,
         1.33225943e-01],
       [-4.16009689e-01, -1.28197243e+00,  1.37235899e-01,
         1.33225943e-01],
       [-4.16009689e-01, -1.05056946e+00,  3.64698715e-01,
         1.75297293e-03],
       [ 3.10997534e-01, -1.24957601e-01,  4.78430123e-01,
         2.64698913e-01],
       [-5.25060772e-02, -1.05056946e+00,  1.37235899e-01,
         1.75297293e-03],
       [-1.02184904e+00, -1.74477836e+00, -2.60824029e-01,
        -2.61192967e-01],
       [-2.94841818e-01, -8.19166497e-01,  2.50967307e-01,
         1.33225943e-01],
       [-1.73673948e-01, -1.24957601e-01,  2.50967307e-01,
         1.75297293e-03],
       [-1.73673948e-01, -3.56360566e-01,  2.50967307e-01,
         1.33225943e-01],
       [ 4.32165405e-01, -3.56360566e-01,  3.07833011e-01,
         1.33225943e-01],
       [-9.00681170e-01, -1.28197243e+00, -4.31421141e-01,
        -1.29719997e-01],
       [-1.73673948e-01, -5.87763531e-01,  1.94101603e-01,
         1.33225943e-01],
       [ 5.53333275e-01,  5.69251294e-01,  1.27454998e+00,
         1.71090158e+00],
       [-5.25060772e-02, -8.19166497e-01,  7.62758643e-01,
         9.22063763e-01],
       [ 1.52267624e+00, -1.24957601e-01,  1.21768427e+00,
         1.18500970e+00],
       [ 5.53333275e-01, -3.56360566e-01,  1.04708716e+00,
         7.90590793e-01],
       [ 7.95669016e-01, -1.24957601e-01,  1.16081857e+00,
         1.31648267e+00],
       [ 2.12851559e+00, -1.24957601e-01,  1.61574420e+00,
         1.18500970e+00],
       [-1.14301691e+00, -1.28197243e+00,  4.21564419e-01,
         6.59117823e-01],
       [ 1.76501198e+00, -3.56360566e-01,  1.44514709e+00,
         7.90590793e-01],
       [ 1.03800476e+00, -1.28197243e+00,  1.16081857e+00,
         7.90590793e-01],
       [ 1.64384411e+00,  1.26346019e+00,  1.33141568e+00,
         1.71090158e+00],
       [ 7.95669016e-01,  3.37848329e-01,  7.62758643e-01,
         1.05353673e+00],
       [ 6.74501145e-01, -8.19166497e-01,  8.76490051e-01,
         9.22063763e-01],
       [ 1.15917263e+00, -1.24957601e-01,  9.90221459e-01,
         1.18500970e+00],
       [-1.73673948e-01, -1.28197243e+00,  7.05892939e-01,
         1.05353673e+00],
       [-5.25060772e-02, -5.87763531e-01,  7.62758643e-01,
         1.57942861e+00],
       [ 6.74501145e-01,  3.37848329e-01,  8.76490051e-01,
         1.44795564e+00],
       [ 7.95669016e-01, -1.24957601e-01,  9.90221459e-01,
         7.90590793e-01],
       [ 2.24968346e+00,  1.72626612e+00,  1.67260991e+00,
         1.31648267e+00],
       [ 2.24968346e+00, -1.05056946e+00,  1.78634131e+00,
         1.44795564e+00],
       [ 1.89829664e-01, -1.97618132e+00,  7.05892939e-01,
         3.96171883e-01],
       [ 1.28034050e+00,  3.37848329e-01,  1.10395287e+00,
         1.44795564e+00],
       [-2.94841818e-01, -5.87763531e-01,  6.49027235e-01,
         1.05353673e+00],
       [ 2.24968346e+00, -5.87763531e-01,  1.67260991e+00,
         1.05353673e+00],
       [ 5.53333275e-01, -8.19166497e-01,  6.49027235e-01,
         7.90590793e-01],
       [ 1.03800476e+00,  5.69251294e-01,  1.10395287e+00,
         1.18500970e+00],
       [ 1.64384411e+00,  3.37848329e-01,  1.27454998e+00,
         7.90590793e-01],
       [ 4.32165405e-01, -5.87763531e-01,  5.92161531e-01,
         7.90590793e-01],
       [ 3.10997534e-01, -1.24957601e-01,  6.49027235e-01,
         7.90590793e-01],
       [ 6.74501145e-01, -5.87763531e-01,  1.04708716e+00,
         1.18500970e+00],
       [ 1.64384411e+00, -1.24957601e-01,  1.16081857e+00,
         5.27644853e-01],
       [ 1.88617985e+00, -5.87763531e-01,  1.33141568e+00,
         9.22063763e-01],
       [ 2.49201920e+00,  1.72626612e+00,  1.50201279e+00,
         1.05353673e+00],
       [ 6.74501145e-01, -5.87763531e-01,  1.04708716e+00,
         1.31648267e+00],
       [ 5.53333275e-01, -5.87763531e-01,  7.62758643e-01,
         3.96171883e-01],
       [ 3.10997534e-01, -1.05056946e+00,  1.04708716e+00,
         2.64698913e-01],
       [ 2.24968346e+00, -1.24957601e-01,  1.33141568e+00,
         1.44795564e+00],
       [ 5.53333275e-01,  8.00654259e-01,  1.04708716e+00,
         1.57942861e+00],
       [ 6.74501145e-01,  1.06445364e-01,  9.90221459e-01,
         7.90590793e-01],
       [ 1.89829664e-01, -1.24957601e-01,  5.92161531e-01,
         7.90590793e-01],
       [ 1.28034050e+00,  1.06445364e-01,  9.33355755e-01,
         1.18500970e+00],
       [ 1.03800476e+00,  1.06445364e-01,  1.04708716e+00,
         1.57942861e+00],
       [ 1.28034050e+00,  1.06445364e-01,  7.62758643e-01,
         1.44795564e+00],
       [-5.25060772e-02, -8.19166497e-01,  7.62758643e-01,
         9.22063763e-01],
       [ 1.15917263e+00,  3.37848329e-01,  1.21768427e+00,
         1.44795564e+00],
       [ 1.03800476e+00,  5.69251294e-01,  1.10395287e+00,
         1.71090158e+00],
       [ 1.03800476e+00, -1.24957601e-01,  8.19624347e-01,
         1.44795564e+00],
       [ 5.53333275e-01, -1.28197243e+00,  7.05892939e-01,
         9.22063763e-01],
       [ 7.95669016e-01, -1.24957601e-01,  8.19624347e-01,
         1.05353673e+00],
       [ 4.32165405e-01,  8.00654259e-01,  9.33355755e-01,
         1.44795564e+00],
       [ 6.86617933e-02, -1.24957601e-01,  7.62758643e-01,
         7.90590793e-01]])

Clustering (scaled data)

In [10]:
# create a k-means object with 2 clusters
kmeans_scaled = KMeans(2)
# fit the data
kmeans_scaled.fit(x_scaled)
Out[10]:
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
       n_clusters=2, n_init=10, n_jobs=None, precompute_distances='auto',
       random_state=None, tol=0.0001, verbose=0)
In [11]:
# create a copy of data, so we can see the clusters next to the original data
clusters_scaled = data.copy()
# predict the cluster for each observation
clusters_scaled['cluster_pred']=kmeans_scaled.fit_predict(x_scaled)
In [21]:
# create a scatter plot based on two corresponding features (sepal_length and sepal_width; OR petal_length and petal_width)
plt.scatter(clusters_scaled['sepal_length'], clusters_scaled['sepal_width'], c= clusters_scaled ['cluster_pred'], cmap = 'rainbow')
plt.xlabel('Length of sepal')
plt.ylabel('Width of sepal')
Out[21]:
Text(0, 0.5, 'Width of sepal')
  • The resulting clusters appear to be overlapping slightly less but overall very similar to those using the unstandardized data, which makes sense, since the variables were pretty much on the same scale of magnitude anyway.

Take Advantage of the Elbow Method

Since we don't know how many clusters we should have.

WCSS

In [13]:
wcss = []
# 'cl_num' is a that keeps track the highest number of clusters we want to use the WCSS method for.
cl_num = 10
for i in range (1,cl_num):
    kmeans= KMeans(i)
    kmeans.fit(x_scaled)
    wcss_iter = kmeans.inertia_
    wcss.append(wcss_iter)
wcss
Out[13]:
[600.0,
 223.73200573676345,
 140.96581663074699,
 114.42970777082235,
 91.06677122728536,
 81.86484622750882,
 71.94075751907857,
 62.597593357969686,
 54.88995998878074]

The Elbow Method

In [14]:
number_clusters = range(1,cl_num)
plt.plot(number_clusters, wcss)
plt.title('The Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('Within-cluster Sum of Squares')
Out[14]:
Text(0, 0.5, 'Within-cluster Sum of Squares')

Based on the Elbow Curve, we'll plot several graphs with the appropriate amounts of clusters we believe would best fit the data.

Understanding the Elbow Curve

Based on the Elbow Curve, 2, 3 or 5 clusters seem the most likely.

2 clusters

Start by seperating the standardized data into 2 clusters.

In [15]:
kmeans_2 = KMeans(2)
kmeans_2.fit(x_scaled)
Out[15]:
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
       n_clusters=2, n_init=10, n_jobs=None, precompute_distances='auto',
       random_state=None, tol=0.0001, verbose=0)

Construct a scatter plot of the original data using the standardized clusters.

Remember that we are plotting the non-standardized values of the sepal length and width for clearer interpretability.

Otherwise, interpreting the scales of the standardizing values on each axis would be confusing

In [16]:
clusters_2 = x.copy()
clusters_2['cluster_pred']=kmeans_2.fit_predict(x_scaled)
In [22]:
plt.scatter(clusters_2['sepal_length'], clusters_2['sepal_width'], c= clusters_2 ['cluster_pred'], cmap = 'rainbow')
plt.xlabel('Length of sepal')
plt.ylabel('Width of sepal')
Out[22]:
Text(0, 0.5, 'Width of sepal')

3 Clusters

Redo the same for 3 and 5 clusters.

In [18]:
kmeans_3 = KMeans(3)
kmeans_3.fit(x_scaled)
Out[18]:
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
       n_clusters=3, n_init=10, n_jobs=None, precompute_distances='auto',
       random_state=None, tol=0.0001, verbose=0)
In [19]:
clusters_3 = x.copy()
clusters_3['cluster_pred']=kmeans_3.fit_predict(x_scaled)
In [23]:
plt.scatter(clusters_3['sepal_length'], clusters_3['sepal_width'], c= clusters_3 ['cluster_pred'], cmap = 'rainbow')
plt.xlabel('Length of sepal')
plt.ylabel('Width of sepal')
Out[23]:
Text(0, 0.5, 'Width of sepal')

5 Clusters

In [25]:
kmeans_5 = KMeans(5)
kmeans_5.fit(x_scaled)
Out[25]:
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
       n_clusters=5, n_init=10, n_jobs=None, precompute_distances='auto',
       random_state=None, tol=0.0001, verbose=0)
In [26]:
clusters_5 = x.copy()
clusters_5['cluster_pred']=kmeans_5.fit_predict(x_scaled)
In [27]:
plt.scatter(clusters_5['sepal_length'], clusters_5['sepal_width'], c= clusters_5 ['cluster_pred'], cmap = 'rainbow')
plt.xlabel('Length of sepal')
plt.ylabel('Width of sepal')
Out[27]:
Text(0, 0.5, 'Width of sepal')

Compare solutions to the original iris dataset

The original (full) iris data is located in iris_with_answers.csv.

The 2-cluster solution seemed good, but in real life the iris dataset has 3 SPECIES (a 3-cluster solution). Therefore, clustering cannot be trusted at all times. Sometimes it seems like x clusters are a good solution, but in real life, there are more (or less).

In [30]:
real_data = pd.read_csv('iris-with-answers.csv')
In [31]:
real_data['species'].unique()
Out[31]:
array(['setosa', 'versicolor', 'virginica'], dtype=object)
In [32]:
# We use the map function to change any 'yes' values to 1 and 'no'values to 0. 
real_data['species'] = real_data['species'].map({'setosa':0, 'versicolor':1 , 'virginica':2})
In [33]:
real_data.head()
Out[33]:
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 0
1 4.9 3.0 1.4 0.2 0
2 4.7 3.2 1.3 0.2 0
3 4.6 3.1 1.5 0.2 0
4 5.0 3.6 1.4 0.2 0

Scatter plots (which we will use for comparison)

'Real data'

Looking at the first graph it seems like the clustering solution is much more intertwined than what we imagined (and what we found before)

In [34]:
plt.scatter(real_data['sepal_length'], real_data['sepal_width'], c= real_data ['species'], cmap = 'rainbow')
Out[34]:
<matplotlib.collections.PathCollection at 0x1a183b9f50>

Examining the other scatter plot (petal length vs petal width), we see that in fact the features which actually make the species different are petals and NOT sepals!

Note that 'real data' is the data observed in the real world (biological data)

In [35]:
plt.scatter(real_data['petal_length'], real_data['petal_width'], c= real_data ['species'], cmap = 'rainbow')
Out[35]:
<matplotlib.collections.PathCollection at 0x1a183b9450>

Our clustering solution data

It seems that our solution takes into account mainly the sepal features

In [36]:
plt.scatter(clusters_3['sepal_length'], clusters_3['sepal_width'], c= clusters_3 ['cluster_pred'], cmap = 'rainbow')
Out[36]:
<matplotlib.collections.PathCollection at 0x1a184f2f50>

Instead of the petals...

In [37]:
plt.scatter(clusters_3['petal_length'], clusters_3['petal_width'], c= clusters_3 ['cluster_pred'], cmap = 'rainbow')
Out[37]:
<matplotlib.collections.PathCollection at 0x1a184fe190>

Conclusion

Since the actual number of clusters is 3, we can see that:

  • the Elbow method is imperfect (we might have opted for 2 or even 4)
  • k-means is very useful in moments where we already know the number of clusters - in this case: 3
  • biology cannot be always quantified (or better).. quantified with k-means! Other methods might be much better at that