A Fleet Management company is planning to provide safe & Efficient Driving training to the drivers. They also want to provide benefits to the drivers based on their performance. The Company wants to group their drivers into different segments based on the Speeding Behaviour and the Distance they drive per day. The Company wants you to build a Segentation/clustering Model which will group their drivers into groups.
You have been provided with the necessary data as a .csv file.
import pandas as pd
from sklearn.cluster import KMeans
import warnings;
warnings.simplefilter('ignore')
df = pd.read_csv('C:\\Users\\Mohamed Kani\\Desktop\\Data_Science\\KMeans\\Python\\Kmeans.csv')
df.head()
df.describe()
import seaborn as sns
sns.distplot(df['Distance_Feature'])
There are two groups of drivers here, one group covering average 50 Miles/Day. The other covers average 160 Miles/Day.
sns.distplot(df['Speeding_Feature'])
Most of the Drivers ride at the speed around 10 Miles/Hour. But there are few drivers who drive at a speed of even 100 Miles/hour!!!
sns.jointplot("Distance_Feature", "Speeding_Feature", data=df)
Let us try to form two segments of data from the DataSet using KMeans algorithm
'''Note: No train Test Data Split as this is an unsupervised Learning Algorithm'''
kmeans = KMeans(n_clusters=2)
kmeans.fit(df.drop('Driver_ID',axis=1))
The Below command gives you the centroids of the clusters. This gives a representation of the group as a whole.
kmeans.cluster_centers_
Now, Let us add the cluster numbers to the data frame in a new column - label_2
label=kmeans.labels_
df['label_2']=label
set(df['label_2'])
The above output shows that two groups have been formed from the dataset given.
sns.lmplot('Distance_Feature','Speeding_Feature',data=df,fit_reg=False,hue="label_2")
If by chance, Client asks for four groups from the data
Let us try to form four segments of data from the DataSet using KMeans algorithm
kmeans = KMeans(n_clusters=4)
kmeans.fit(df.drop(['Driver_ID','label_2'],axis=1))
kmeans.cluster_centers_
label_4=kmeans.labels_
set(label_4)
The above output shows that we have four segments of data groups.
df['label_4']=label_4
df.head()
Let us visualise the four clusters now.
sns.lmplot('Distance_Feature', 'Speeding_Feature', data=df, fit_reg=False, hue="label_4")
sse={}
#For Loop to capture the Inertia
for k in range(1, 20):
kmeans = KMeans(n_clusters=k).fit(df.drop(['Driver_ID','label_2','label_4'],axis=1))
sse[k] = kmeans.inertia_
#Store the no. of groups and the error as separate lists
groups=list(sse.keys())
error=list(sse.values())
#Club the lists as a dataframe
error_df=pd.DataFrame(list(zip(groups, error)),columns=['groups','error'])
error_df.head()
sns.pointplot(x="groups", y="error", data=error_df)
From the Above Graph, it looks like 2 groups itself is efficient clustering of the data set
But the error rate seems to reduce till 6 Groups, after which it is consistent.
As we already have seen how 4 groupd are formed, let us try for 5 segments and 6 segments.
kmeans = KMeans(n_clusters=5)
kmeans.fit(df.drop(['Driver_ID','label_2','label_4'],axis=1))
df['label_5']=kmeans.labels_
sns.lmplot('Distance_Feature', 'Speeding_Feature', data=df, fit_reg=False, hue="label_5")
kmeans = KMeans(n_clusters=6)
kmeans.fit(df.drop(['Driver_ID','label_2','label_4','label_5'],axis=1))
df['label_6']=kmeans.labels_
sns.lmplot('Distance_Feature', 'Speeding_Feature', data=df, fit_reg=False, hue="label_6")
Hence using the K Means Algorithm, the given set of 4000 Drivers have been classified into 2,4,5 and 6 Groups. Upon receiving confirmation from the client on number of groups to fix, we shall freeze the Model.