Given a DataSet of 81 patients who have undergone a spinal surgery for a deformation and the data if the condition recurred, Build a claddification Model to predict whether a patient being admitted for the surgery has chance for recurrence. This model will help the surgeons to plan appropriate level of treatment to prevent recurrence.
The kyphosis data frame has 81 rows and 4 columns. representing data on children who have had corrective spinal surgery
This data frame contains the following columns/Features:
Kyphosis :a factor with levels absent present indicating if a kyphosis (a type of deformation) was present after the operation.
Age :in months
Number :the number of vertebrae involved
Start :the number of the first (topmost) vertebra operated on.
Let us compare the accuracy and suggest the best classifier.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
df = pd.read_csv('kyphosis.csv')
df.head()
We'll just check out a simple pairplot for this small dataset.
sns.pairplot(df,hue='Kyphosis',palette='Set1')
Let's split up the data into a training set and a test set!
from sklearn.model_selection import train_test_split
X = df.drop('Kyphosis',axis=1)
y = df['Kyphosis']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30,random_state=69)
We'll start just by training a single decision tree.
from sklearn.tree import DecisionTreeClassifier
dtree = DecisionTreeClassifier()
dtree.fit(X_train,y_train)
Let's evaluate our decision tree.
predictions = dtree.predict(X_test)
from sklearn.metrics import classification_report,confusion_matrix
print(classification_report(y_test,predictions))
print(confusion_matrix(y_test,predictions))
Now let's compare the decision tree model to a random forest.
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(n_estimators=100)
rfc.fit(X_train, y_train)
Let's evaluate the Random Forest Classifier
rfc_pred = rfc.predict(X_test)
print(confusion_matrix(y_test,rfc_pred))
print(classification_report(y_test,rfc_pred))
from sklearn.linear_model import LogisticRegression
logmodel = LogisticRegression()
logmodel.fit(X_train,y_train)
Let's evaluate the Logistic Regression Classifier
predictions = logmodel.predict(X_test)
from sklearn.metrics import classification_report
print(classification_report(y_test,predictions))
KNN Needs to scale the Data befor we build the Model
from sklearn.preprocessing import StandardScaler
#Standardize the data to a common scale
scaler= StandardScaler()
scaler.fit(df.drop('Kyphosis',axis=1))
scaled_features= scaler.transform(df.drop('Kyphosis',axis=1))
df_feat=pd.DataFrame(scaled_features,columns=df.columns[1:])
from sklearn.cross_validation import train_test_split
X= df_feat
y=df['Kyphosis']
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3,random_state=101)
#Elbow method to determine the optimal K value
from sklearn.neighbors import KNeighborsClassifier
error_rate=[]
for i in range(1,20):
knn=KNeighborsClassifier(n_neighbors=i)
knn.fit(X_train,y_train)
pred_i = knn.predict(X_test)
error_rate.append(np.mean(pred_i != y_test))
plt.plot(range(1,20),error_rate)
#Build the Model with 3 as K
knn=KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train,y_train)
# Predict and Evaluate the Model
pred = knn.predict(X_test)
print(confusion_matrix(y_test,pred))
print(classification_report(y_test,pred))
The Accuracy Levels of the Models is observed to be:
1. Decision Tree Classifier - 77%
2. Random Forest Classifier - 77%
3. Logistic Regression Classifier - 86%
4. KNN Classifier - 76%
Out of the four, Logistic Regression Classifier gives out 86% Accuracy. SO For the Kyphosis Recurrence Predictor we are building, we suggest Logistic Regression Classification Algorithm.