The Iris flower dataset is one of the most popular ones for machine learning: https://en.wikipedia.org/wiki/Iris_flower_data_set
There are 4 features: sepal length, sepal width, petal length, and petal width.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
from sklearn.cluster import KMeans
Load data from the csv file: 'iris_dataset.csv'.
# Load the data
data = pd.read_csv('iris-dataset.csv')
# Check the data
data
We will try to cluster the iris flowers by the shape of their sepal.
# Create a scatter plot based on two corresponding features (sepal_length and sepal_width; OR petal_length and petal_width)
plt.scatter(data['sepal_length'],data['sepal_width'])
# Name your axes
plt.xlabel('Length of sepal')
plt.ylabel('Width of sepal')
plt.show()
Separate the original data into 2 clusters.
# create a variable which will contain the data for the clustering
x = data.copy()
# create a k-means object with 2 clusters
kmeans = KMeans(2)
# fit the data
kmeans.fit(x)
# create a copy of data, so we can see the clusters next to the original data
clusters = data.copy()
# predict the cluster for each observation
clusters['cluster_pred']=kmeans.fit_predict(x)
# create a scatter plot based on two corresponding features (sepal_length and sepal_width; OR petal_length and petal_width)
plt.scatter(clusters['sepal_length'], clusters['sepal_width'], c= clusters ['cluster_pred'], cmap = 'rainbow')
plt.xlabel('Length of sepal')
plt.ylabel('Width of sepal')
Import and use the scale function from sklearn to standardize the data.
# import some preprocessing module
from sklearn import preprocessing
# scale the data for better results
x_scaled = preprocessing.scale(data)
x_scaled
# create a k-means object with 2 clusters
kmeans_scaled = KMeans(2)
# fit the data
kmeans_scaled.fit(x_scaled)
# create a copy of data, so we can see the clusters next to the original data
clusters_scaled = data.copy()
# predict the cluster for each observation
clusters_scaled['cluster_pred']=kmeans_scaled.fit_predict(x_scaled)
# create a scatter plot based on two corresponding features (sepal_length and sepal_width; OR petal_length and petal_width)
plt.scatter(clusters_scaled['sepal_length'], clusters_scaled['sepal_width'], c= clusters_scaled ['cluster_pred'], cmap = 'rainbow')
plt.xlabel('Length of sepal')
plt.ylabel('Width of sepal')
Since we don't know how many clusters we should have.
wcss = []
# 'cl_num' is a that keeps track the highest number of clusters we want to use the WCSS method for.
cl_num = 10
for i in range (1,cl_num):
kmeans= KMeans(i)
kmeans.fit(x_scaled)
wcss_iter = kmeans.inertia_
wcss.append(wcss_iter)
wcss
number_clusters = range(1,cl_num)
plt.plot(number_clusters, wcss)
plt.title('The Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('Within-cluster Sum of Squares')
Based on the Elbow Curve, we'll plot several graphs with the appropriate amounts of clusters we believe would best fit the data.
Based on the Elbow Curve, 2, 3 or 5 clusters seem the most likely.
Start by seperating the standardized data into 2 clusters.
kmeans_2 = KMeans(2)
kmeans_2.fit(x_scaled)
Construct a scatter plot of the original data using the standardized clusters.
Remember that we are plotting the non-standardized values of the sepal length and width for clearer interpretability.
Otherwise, interpreting the scales of the standardizing values on each axis would be confusing
clusters_2 = x.copy()
clusters_2['cluster_pred']=kmeans_2.fit_predict(x_scaled)
plt.scatter(clusters_2['sepal_length'], clusters_2['sepal_width'], c= clusters_2 ['cluster_pred'], cmap = 'rainbow')
plt.xlabel('Length of sepal')
plt.ylabel('Width of sepal')
Redo the same for 3 and 5 clusters.
kmeans_3 = KMeans(3)
kmeans_3.fit(x_scaled)
clusters_3 = x.copy()
clusters_3['cluster_pred']=kmeans_3.fit_predict(x_scaled)
plt.scatter(clusters_3['sepal_length'], clusters_3['sepal_width'], c= clusters_3 ['cluster_pred'], cmap = 'rainbow')
plt.xlabel('Length of sepal')
plt.ylabel('Width of sepal')
kmeans_5 = KMeans(5)
kmeans_5.fit(x_scaled)
clusters_5 = x.copy()
clusters_5['cluster_pred']=kmeans_5.fit_predict(x_scaled)
plt.scatter(clusters_5['sepal_length'], clusters_5['sepal_width'], c= clusters_5 ['cluster_pred'], cmap = 'rainbow')
plt.xlabel('Length of sepal')
plt.ylabel('Width of sepal')
The original (full) iris data is located in iris_with_answers.csv.
The 2-cluster solution seemed good, but in real life the iris dataset has 3 SPECIES (a 3-cluster solution). Therefore, clustering cannot be trusted at all times. Sometimes it seems like x clusters are a good solution, but in real life, there are more (or less).
real_data = pd.read_csv('iris-with-answers.csv')
real_data['species'].unique()
# We use the map function to change any 'yes' values to 1 and 'no'values to 0.
real_data['species'] = real_data['species'].map({'setosa':0, 'versicolor':1 , 'virginica':2})
real_data.head()
Looking at the first graph it seems like the clustering solution is much more intertwined than what we imagined (and what we found before)
plt.scatter(real_data['sepal_length'], real_data['sepal_width'], c= real_data ['species'], cmap = 'rainbow')
Examining the other scatter plot (petal length vs petal width), we see that in fact the features which actually make the species different are petals and NOT sepals!
Note that 'real data' is the data observed in the real world (biological data)
plt.scatter(real_data['petal_length'], real_data['petal_width'], c= real_data ['species'], cmap = 'rainbow')
It seems that our solution takes into account mainly the sepal features
plt.scatter(clusters_3['sepal_length'], clusters_3['sepal_width'], c= clusters_3 ['cluster_pred'], cmap = 'rainbow')
Instead of the petals...
plt.scatter(clusters_3['petal_length'], clusters_3['petal_width'], c= clusters_3 ['cluster_pred'], cmap = 'rainbow')
Since the actual number of clusters is 3, we can see that: