Kyphosis Recurrence Predictor¶

Compare and Analyse various Classifier Algorithms¶

Problem Statement:¶

Given a DataSet of 81 patients who have undergone a spinal surgery for a deformation and the data if the condition recurred, Build a claddification Model to predict whether a patient being admitted for the surgery has chance for recurrence. This model will help the surgeons to plan appropriate level of treatment to prevent recurrence.

Data Set Description :¶

The kyphosis data frame has 81 rows and 4 columns. representing data on children who have had corrective spinal surgery

This data frame contains the following columns/Features:

Kyphosis :a factor with levels absent present indicating if a kyphosis (a type of deformation) was present after the operation.
Age :in months
Number :the number of vertebrae involved
Start :the number of the first (topmost) vertebra operated on.

Algorithms Suggested:¶

Decision Trees
Random Forest
KNN Classifier
Logistic Regression

Let us compare the accuracy and suggest the best classifier.

Import Libraries¶

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

Get the Data¶

df = pd.read_csv('kyphosis.csv')

df.head()

EDA¶

We'll just check out a simple pairplot for this small dataset.

sns.pairplot(df,hue='Kyphosis',palette='Set1')

<seaborn.axisgrid.PairGrid at 0xb9d2978>

Train Test Split¶

Let's split up the data into a training set and a test set!

from sklearn.model_selection import train_test_split

X = df.drop('Kyphosis',axis=1)
y = df['Kyphosis']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30,random_state=69)

DECISION TREES¶

We'll start just by training a single decision tree.

from sklearn.tree import DecisionTreeClassifier

dtree = DecisionTreeClassifier()

dtree.fit(X_train,y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

Prediction and Evaluation¶

Let's evaluate our decision tree.

predictions = dtree.predict(X_test)

from sklearn.metrics import classification_report,confusion_matrix

print(classification_report(y_test,predictions))

             precision    recall  f1-score   support

     absent       0.88      0.95      0.91        22
    present       0.00      0.00      0.00         3

avg / total       0.77      0.84      0.80        25

print(confusion_matrix(y_test,predictions))

[[21  1]
 [ 3  0]]

RANDOM FOREST¶

Now let's compare the decision tree model to a random forest.

from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(n_estimators=100)
rfc.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

Let's evaluate the Random Forest Classifier

rfc_pred = rfc.predict(X_test)

print(confusion_matrix(y_test,rfc_pred))

[[21  1]
 [ 3  0]]

print(classification_report(y_test,rfc_pred))

             precision    recall  f1-score   support

     absent       0.88      0.95      0.91        22
    present       0.00      0.00      0.00         3

avg / total       0.77      0.84      0.80        25

LOGISTIC REGRESSION¶

from sklearn.linear_model import LogisticRegression
logmodel = LogisticRegression()
logmodel.fit(X_train,y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

Let's evaluate the Logistic Regression Classifier

predictions = logmodel.predict(X_test)

from sklearn.metrics import classification_report
print(classification_report(y_test,predictions))

             precision    recall  f1-score   support

     absent       0.91      0.95      0.93        22
    present       0.50      0.33      0.40         3

avg / total       0.86      0.88      0.87        25

KNN CLASSIFIER MODEL¶

KNN Needs to scale the Data befor we build the Model

from sklearn.preprocessing import StandardScaler
#Standardize  the data to a common scale
scaler= StandardScaler()
scaler.fit(df.drop('Kyphosis',axis=1))
scaled_features= scaler.transform(df.drop('Kyphosis',axis=1))
df_feat=pd.DataFrame(scaled_features,columns=df.columns[1:])

from sklearn.cross_validation import  train_test_split
X= df_feat
y=df['Kyphosis']
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3,random_state=101)

#Elbow method to determine the optimal K value
from sklearn.neighbors import KNeighborsClassifier
error_rate=[]
for i in range(1,20):
 knn=KNeighborsClassifier(n_neighbors=i)
 knn.fit(X_train,y_train)
 pred_i = knn.predict(X_test)
 error_rate.append(np.mean(pred_i != y_test))
 
plt.plot(range(1,20),error_rate)

[<matplotlib.lines.Line2D at 0xc24e128>]

#Build the Model with 3 as K
knn=KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train,y_train)

# Predict and Evaluate the Model
pred = knn.predict(X_test)
print(confusion_matrix(y_test,pred))
print(classification_report(y_test,pred))

[[16  1]
 [ 5  3]]
             precision    recall  f1-score   support

     absent       0.76      0.94      0.84        17
    present       0.75      0.38      0.50         8

avg / total       0.76      0.76      0.73        25

Conclusion¶

The Accuracy Levels of the Models is observed to be:

1. Decision Tree Classifier - 77%
2. Random Forest Classifier - 77%
3. Logistic Regression Classifier - 86%
4. KNN Classifier - 76%

Out of the four, Logistic Regression Classifier gives out 86% Accuracy. SO For the Kyphosis Recurrence Predictor we are building, we suggest Logistic Regression Classification Algorithm.