Kyphosis Recurrence Predictor

Compare and Analyse various Classifier Algorithms

Problem Statement:

Given a DataSet of 81 patients who have undergone a spinal surgery for a deformation and the data if the condition recurred, Build a claddification Model to predict whether a patient being admitted for the surgery has chance for recurrence. This model will help the surgeons to plan appropriate level of treatment to prevent recurrence.

Data Set Description :

The kyphosis data frame has 81 rows and 4 columns. representing data on children who have had corrective spinal surgery

This data frame contains the following columns/Features:

  1. Kyphosis :a factor with levels absent present indicating if a kyphosis (a type of deformation) was present after the operation.

  2. Age :in months

  3. Number :the number of vertebrae involved

  4. Start :the number of the first (topmost) vertebra operated on.

Algorithms Suggested:

  1. Decision Trees
  2. Random Forest
  3. KNN Classifier
  4. Logistic Regression

Let us compare the accuracy and suggest the best classifier.

Import Libraries

In [24]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

Get the Data

In [25]:
df = pd.read_csv('kyphosis.csv')
In [26]:
df.head()
Out[26]:
Kyphosis Age Number Start
0 absent 71 3 5
1 absent 158 3 14
2 present 128 4 5
3 absent 2 5 1
4 absent 1 4 15

EDA

We'll just check out a simple pairplot for this small dataset.

In [27]:
sns.pairplot(df,hue='Kyphosis',palette='Set1')
Out[27]:
<seaborn.axisgrid.PairGrid at 0xb9d2978>

Train Test Split

Let's split up the data into a training set and a test set!

In [47]:
from sklearn.model_selection import train_test_split
In [48]:
X = df.drop('Kyphosis',axis=1)
y = df['Kyphosis']
In [56]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30,random_state=69)

DECISION TREES

We'll start just by training a single decision tree.

In [57]:
from sklearn.tree import DecisionTreeClassifier
In [58]:
dtree = DecisionTreeClassifier()
In [59]:
dtree.fit(X_train,y_train)
Out[59]:
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

Prediction and Evaluation

Let's evaluate our decision tree.

In [60]:
predictions = dtree.predict(X_test)
In [61]:
from sklearn.metrics import classification_report,confusion_matrix
In [62]:
print(classification_report(y_test,predictions))
             precision    recall  f1-score   support

     absent       0.88      0.95      0.91        22
    present       0.00      0.00      0.00         3

avg / total       0.77      0.84      0.80        25

In [63]:
print(confusion_matrix(y_test,predictions))
[[21  1]
 [ 3  0]]

RANDOM FOREST

Now let's compare the decision tree model to a random forest.

In [64]:
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(n_estimators=100)
rfc.fit(X_train, y_train)
Out[64]:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

Let's evaluate the Random Forest Classifier

In [65]:
rfc_pred = rfc.predict(X_test)
In [66]:
print(confusion_matrix(y_test,rfc_pred))
[[21  1]
 [ 3  0]]
In [67]:
print(classification_report(y_test,rfc_pred))
             precision    recall  f1-score   support

     absent       0.88      0.95      0.91        22
    present       0.00      0.00      0.00         3

avg / total       0.77      0.84      0.80        25

LOGISTIC REGRESSION

In [68]:
from sklearn.linear_model import LogisticRegression
logmodel = LogisticRegression()
logmodel.fit(X_train,y_train)
Out[68]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

Let's evaluate the Logistic Regression Classifier

In [69]:
predictions = logmodel.predict(X_test)

from sklearn.metrics import classification_report
print(classification_report(y_test,predictions))
             precision    recall  f1-score   support

     absent       0.91      0.95      0.93        22
    present       0.50      0.33      0.40         3

avg / total       0.86      0.88      0.87        25

KNN CLASSIFIER MODEL

KNN Needs to scale the Data befor we build the Model

In [70]:
from sklearn.preprocessing import StandardScaler
#Standardize  the data to a common scale
scaler= StandardScaler()
scaler.fit(df.drop('Kyphosis',axis=1))
scaled_features= scaler.transform(df.drop('Kyphosis',axis=1))
df_feat=pd.DataFrame(scaled_features,columns=df.columns[1:])
In [71]:
from sklearn.cross_validation import  train_test_split
X= df_feat
y=df['Kyphosis']
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3,random_state=101)

#Elbow method to determine the optimal K value
from sklearn.neighbors import KNeighborsClassifier
error_rate=[]
for i in range(1,20):
 knn=KNeighborsClassifier(n_neighbors=i)
 knn.fit(X_train,y_train)
 pred_i = knn.predict(X_test)
 error_rate.append(np.mean(pred_i != y_test))
 
plt.plot(range(1,20),error_rate)
Out[71]:
[<matplotlib.lines.Line2D at 0xc24e128>]
In [72]:
#Build the Model with 3 as K
knn=KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train,y_train)

# Predict and Evaluate the Model
pred = knn.predict(X_test)
print(confusion_matrix(y_test,pred))
print(classification_report(y_test,pred))
[[16  1]
 [ 5  3]]
             precision    recall  f1-score   support

     absent       0.76      0.94      0.84        17
    present       0.75      0.38      0.50         8

avg / total       0.76      0.76      0.73        25

Conclusion

The Accuracy Levels of the Models is observed to be:

1. Decision Tree Classifier - 77%
2. Random Forest Classifier - 77%
3. Logistic Regression Classifier - 86%
4. KNN Classifier - 76%

Out of the four, Logistic Regression Classifier gives out 86% Accuracy. SO For the Kyphosis Recurrence Predictor we are building, we suggest Logistic Regression Classification Algorithm.