Use Case:

Let us take the Titanic Data set and build a model which will predict the survival probability - whether or not a passenger would have survived if travelled on Titanic. This can be a prototype to predict accident probability / Air crash Probability predictor and many other such use cases.

Data Set:

Titanic data set contains various data points about each passenger who was onboard in the Titanic Ship. The data set aslo contains whether the passenger survived the crash or not.

Step 1: Import Data

Import Data

df.train <- read.csv('C:\\Users\\uia94128\\Desktop\\MFoi\\Day4\\Titanic\\titanic_train.csv')
head(df.train)
##   PassengerId Survived Pclass
## 1           1        0      3
## 2           2        1      1
## 3           3        1      3
## 4           4        1      1
## 5           5        0      3
## 6           6        0      3
##                                                  Name    Sex Age SibSp
## 1                             Braund, Mr. Owen Harris   male  22     1
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female  38     1
## 3                              Heikkinen, Miss. Laina female  26     0
## 4        Futrelle, Mrs. Jacques Heath (Lily May Peel) female  35     1
## 5                            Allen, Mr. William Henry   male  35     0
## 6                                    Moran, Mr. James   male  NA     0
##   Parch           Ticket    Fare Cabin Embarked
## 1     0        A/5 21171  7.2500              S
## 2     0         PC 17599 71.2833   C85        C
## 3     0 STON/O2. 3101282  7.9250              S
## 4     0           113803 53.1000  C123        S
## 5     0           373450  8.0500              S
## 6     0           330877  8.4583              Q

Exploratory Data Analysis (EDA)

Let us explore how much missing data we have, we can use the Amelia pacakge for this. (Install)

#install.packages("Amelia")
library(Amelia)
## Loading required package: Rcpp
## ## 
## ## Amelia II: Multiple Imputation
## ## (Version 1.7.5, built: 2018-05-07)
## ## Copyright (C) 2005-2018 James Honaker, Gary King and Matthew Blackwell
## ## Refer to http://gking.harvard.edu/amelia/ for more information
## ##
missmap(df.train, main="Titanic Training Data - Missings Map", 
        col=c("yellow", "black"), legend=FALSE)

Roughly 40 percent of the Age data is missing. We shall come up with a way to fil this missing data.

Bold Data Visualization Bold

Survivors Histogram

library(ggplot2)
ggplot(df.train,aes(Survived)) + geom_bar()

Passengers Count Class wise Distribution

ggplot(df.train,aes(Pclass)) + geom_bar(aes(fill=factor(Pclass)),alpha=0.5)

Gender Distribution

ggplot(df.train,aes(Sex)) + geom_bar(aes(fill=factor(Sex)),alpha=0.5)

Bold Data Cleaning Bold

We want to fill in missing age data instead of just dropping the missing age data rows.

One way to do this is by filling in the mean age of all the passengers (imputation).

The other way is to see if there is a pattern relationship between any other column and age. Fill the age accordingly.

pl <- ggplot(df.train,aes(Pclass,Age)) + geom_boxplot(aes(group=Pclass,fill=factor(Pclass),alpha=0.4)) 
pl + scale_y_continuous(breaks = seq(min(0), max(80), by = 2))
## Warning: Removed 177 rows containing non-finite values (stat_boxplot).

We shall use these average age values to impute based on Pclass for Age. Because we see that avg age is different for each class.

impute_age <- function(age,class){
    out <- age
    for (i in 1:length(age)){
        
        if (is.na(age[i])){

            if (class[i] == 1){
                out[i] <- 37

            }else if (class[i] == 2){
                out[i] <- 29

            }else{
                out[i] <- 24
            }
        }else{
            out[i]<-age[i]
        }
    }
    return(out)
}
fixed.ages <- impute_age(df.train$Age,df.train$Pclass)
df.train$Age <- fixed.ages
missmap(df.train, main="Titanic Training Data - Missings Map", 
        col=c("yellow", "black"), legend=FALSE)

Now there is no missing values in the data set.

Rather than using all the columns to build the model, let us use specific columns as input signals for the model.

df.train <- df.train[,!(colnames(df.train) %in% c('PassengerId','Name','Ticket','Cabin'))]

Bold Train and Test Data fit: Bold

library(caTools)
set.seed(101)

split = sample.split(df.train$Survived, SplitRatio = 0.70)

final.train = subset(df.train, split == TRUE)
final.test = subset(df.train, split == FALSE)

Logistic Regression Classifier

Let us build the first classifier model using the Logistic Regression Algorithm. The function for Logistic Regression model is glm(). Three parameters to be passed:

Building the Model:

log.model <- glm(formula=Survived ~ . , family = binomial(link='logit'),data = final.train)

Prediction for test data:

fitted.probabilities <- predict(log.model,newdata=final.test,type='response')

Accuracy Measurement:

fitted.results <- ifelse(fitted.probabilities > 0.5,1,0)
misClasificError <- mean(fitted.results != final.test$Survived)
print(paste('Accuracy',1-misClasificError))
## [1] "Accuracy 0.783582089552239"

The LR Classifier Model we have built is Bold 78.4 % Bold accurate in predicting if a new passenger will survive the journey on Titanic.

Decision Tree Model

Let us now build the next classifier using Decision Trees algorithm bold Library : rpart library bold

library(rpart)
tree <- rpart(Survived ~.,method='class',data = final.train)

Test the Model: Use predict() to predict the Purchase value on the test data.

tree.preds <- predict(tree,final.test)

Check the Head of the predicted values. You should notice that you actually have two columns with the probabilities.

head(tree.preds)
##             0         1
## 4  0.07826087 0.9217391
## 6  0.83547558 0.1645244
## 12 0.07826087 0.9217391
## 21 0.83547558 0.1645244
## 22 0.83547558 0.1645244
## 24 0.83547558 0.1645244

Write a simple R Function to convert these two columns into one column to match the original “1/0” Label for the Purchase column.

tree.preds <- as.data.frame(tree.preds)
joiner <- function(x){
    if (x>=0.5){
        return('1')
    }else{
        return("0")
    }
}

Apply that function to each of the row in the prediction result

tree.preds$Survival <- sapply(tree.preds$'1',joiner)
head(tree.preds)
##             0         1 Survival
## 4  0.07826087 0.9217391        1
## 6  0.83547558 0.1645244        0
## 12 0.07826087 0.9217391        1
## 21 0.83547558 0.1645244        0
## 22 0.83547558 0.1645244        0
## 24 0.83547558 0.1645244        0

Calculate the accuracy By comparing the prediction result with the actual result in Car.test

misClasificError <- mean(tree.preds$Survival != final.test$Survived)
print(paste('Accuracy',1-misClasificError))
## [1] "Accuracy 0.813432835820896"

The Decision Tree Classifier Model we have built is Bold 81.3 % Bold accurate in predicting if a new passenger will survive the journey.

Random Forest Classifier

Let us now try to build a Random forest classifier. bold Library: randonForest() bold

library(randomForest)

Build The Model using Train Data:

rf.model <- randomForest(Survived ~ . , data = final.train)
## Warning in randomForest.default(m, y, ...): The response has five or fewer
## unique values. Are you sure you want to do regression?

Test the Model using the Test Data. Print the confusion Matrix for the prediction Vs actual

p <- predict(rf.model,final.test)

joiner <- function(x){
    if (x>=0.5){
        return('1')
    }else{
        return("0")
    }
}
p<- sapply(p,joiner)


table(p,final.test$Survived)
##    
## p     0   1
##   0 150  29
##   1  15  74

Calculate the accuracy:

misClasificError <- mean(p != final.test$Survived)
print(paste('Accuracy',1-misClasificError))
## [1] "Accuracy 0.835820895522388"

The Random forest Classifier Model we have built is 83.5 % accurate in predicting if a passenger will survive the travel or not.

KNN Algorithm Classifier

Let us now build the classifier using KNN Algorithm.

bold Data needs to be scaled before we pass it on to building the KNN model bold

# Standarize the dataset using "scale()" R function
standardized.titanic <- scale(df.train[,!(colnames(df.train) %in% c('Sex','Survived','Embarked'))])
standardized.titanic <- data.frame(standardized.titanic)

Add the purchase column to the standardized data frame

standardized.titanic$Survived <- df.train$Survived
head(standardized.titanic)
##       Pclass        Age      SibSp      Parch       Fare Survived
## 1  0.8269128 -0.5335340  0.4325504 -0.4734077 -0.5021631        0
## 2 -1.5652278  0.6745117  0.4325504 -0.4734077  0.7864036        1
## 3  0.8269128 -0.2315226 -0.4742788 -0.4734077 -0.4885799        1
## 4 -1.5652278  0.4480031  0.4325504 -0.4734077  0.4204941        1
## 5  0.8269128  0.4480031 -0.4742788 -0.4734077 -0.4860644        0
## 6  0.8269128 -0.3825283 -0.4742788 -0.4734077 -0.4778481        0

Train & Test Data Split from the scaled data

library(caTools)
set.seed(101)

split = sample.split(standardized.titanic$Survived, SplitRatio = 0.70)

titanic.train = subset(standardized.titanic, split == TRUE)
titanic.test = subset(standardized.titanic, split == FALSE)

Build the Model bold Function : knn() bold

library(class)
predicted.Survived <- knn(titanic.train[,!(colnames(titanic.train) %in% c('Survived'))],titanic.test[,!(colnames(titanic.test) %in% c('Survived'))],titanic.train$Survived,k=1)
head(predicted.Survived)
## [1] 0 0 0 1 0 1
## Levels: 0 1

Check The Accuracy of the Model:

misClasificError <- mean(predicted.Survived != titanic.test$Survived)
print(paste('Accuracy',1-misClasificError))
## [1] "Accuracy 0.645522388059701"

Now Let us try with k=3, considering 3 neighbors to predict the result:

library(class)
predicted.Survived <- knn(titanic.train[,!(colnames(titanic.train) %in% c('Survived'))],titanic.test[,!(colnames(titanic.test) %in% c('Survived'))],titanic.train$Survived,k=3)
misClasificError <- mean(predicted.Survived != titanic.test$Survived)
print(paste('Accuracy',1-misClasificError))
## [1] "Accuracy 0.671641791044776"

Now Let us try with k=3, considering 5 neighbors to predict the result:

predicted.Survived <- knn(titanic.train[,!(colnames(titanic.train) %in% c('Survived'))],titanic.test[,!(colnames(titanic.test) %in% c('Survived'))],titanic.train$Survived,k=5)
misClasificError <- mean(predicted.Survived != titanic.test$Survived)
print(paste('Accuracy',1-misClasificError))
## [1] "Accuracy 0.705223880597015"

We observe that the accuracy keeps improving. Let us try to find the optimum value of K.

Code a for Loop to build the model from K=1 to 20 and record the error rate.

predicted.Survived = NULL
error.rate = NULL

for(i in 1:20){
    set.seed(101)
    predicted.Survived <- knn(titanic.train[,!(colnames(titanic.train) %in% c('Survived'))],titanic.test[,!(colnames(titanic.test) %in% c('Survived'))],titanic.train$Survived,k=i)
    error.rate[i] <-mean(predicted.Survived != titanic.test$Survived)
}

print(error.rate)
##  [1] 0.3432836 0.3470149 0.3320896 0.3320896 0.2947761 0.3059701 0.2947761
##  [8] 0.3059701 0.3022388 0.2985075 0.2723881 0.2798507 0.2910448 0.2798507
## [15] 0.2835821 0.2723881 0.2761194 0.2835821 0.2835821 0.2761194

bold Elbow Method: bold

We can plot out the various error rates for the K values. We should see an “elbow” indicating that we don’t get a decrease in error rate for using a higher K. This is a good cut-off point:

To plot some data, we need to have a data frame. Two columns - k value, error rate.

library(ggplot2)
k.values <- 1:20
error.df <- data.frame(error.rate,k.values)
error.df
##    error.rate k.values
## 1   0.3432836        1
## 2   0.3470149        2
## 3   0.3320896        3
## 4   0.3320896        4
## 5   0.2947761        5
## 6   0.3059701        6
## 7   0.2947761        7
## 8   0.3059701        8
## 9   0.3022388        9
## 10  0.2985075       10
## 11  0.2723881       11
## 12  0.2798507       12
## 13  0.2910448       13
## 14  0.2798507       14
## 15  0.2835821       15
## 16  0.2723881       16
## 17  0.2761194       17
## 18  0.2835821       18
## 19  0.2835821       19
## 20  0.2761194       20

Let us plot the K value Vs error Rate :

ggplot(error.df,aes(x=k.values,y=error.rate)) + geom_point()+ geom_line(lty="dotted",color='red')

From the graph we observe that k=11 has the least error. So let us find the accuracy for k=11.

predicted.Survived <- knn(titanic.train[,!(colnames(titanic.train) %in% c('Survived'))],titanic.test[,!(colnames(titanic.test) %in% c('Survived'))],titanic.train$Survived,k=11)
    error.rate[i] <-mean(predicted.Survived != titanic.test$Survived)
print(paste('Accuracy',1-misClasificError))
## [1] "Accuracy 0.705223880597015"

The Accuracy from the KNN classifier with K Value=13 is 70.5 %.

Conclusion:

The summary of the accuracy from the different classifiers we have built:

  1. Logistic Regression - 78.3 %
  2. Decision Trees - 81.3 %
  3. Random Forest - 83.6 %
  4. KNN Classifier - 70.5 %

Either we can conclude that Random Forest is the best classifer in this case.