Let us take the Titanic Data set and build a model which will predict the survival probability - whether or not a passenger would have survived if travelled on Titanic. This can be a prototype to predict accident probability / Air crash Probability predictor and many other such use cases.
Titanic data set contains various data points about each passenger who was onboard in the Titanic Ship. The data set aslo contains whether the passenger survived the crash or not.
Import Data
df.train <- read.csv('C:\\Users\\uia94128\\Desktop\\MFoi\\Day4\\Titanic\\titanic_train.csv')
head(df.train)
## PassengerId Survived Pclass
## 1 1 0 3
## 2 2 1 1
## 3 3 1 3
## 4 4 1 1
## 5 5 0 3
## 6 6 0 3
## Name Sex Age SibSp
## 1 Braund, Mr. Owen Harris male 22 1
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1
## 3 Heikkinen, Miss. Laina female 26 0
## 4 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1
## 5 Allen, Mr. William Henry male 35 0
## 6 Moran, Mr. James male NA 0
## Parch Ticket Fare Cabin Embarked
## 1 0 A/5 21171 7.2500 S
## 2 0 PC 17599 71.2833 C85 C
## 3 0 STON/O2. 3101282 7.9250 S
## 4 0 113803 53.1000 C123 S
## 5 0 373450 8.0500 S
## 6 0 330877 8.4583 Q
Exploratory Data Analysis (EDA)
Let us explore how much missing data we have, we can use the Amelia pacakge for this. (Install)
#install.packages("Amelia")
library(Amelia)
## Loading required package: Rcpp
## ##
## ## Amelia II: Multiple Imputation
## ## (Version 1.7.5, built: 2018-05-07)
## ## Copyright (C) 2005-2018 James Honaker, Gary King and Matthew Blackwell
## ## Refer to http://gking.harvard.edu/amelia/ for more information
## ##
missmap(df.train, main="Titanic Training Data - Missings Map",
col=c("yellow", "black"), legend=FALSE)
Roughly 40 percent of the Age data is missing. We shall come up with a way to fil this missing data.
Bold Data Visualization Bold
Survivors Histogram
library(ggplot2)
ggplot(df.train,aes(Survived)) + geom_bar()
Passengers Count Class wise Distribution
ggplot(df.train,aes(Pclass)) + geom_bar(aes(fill=factor(Pclass)),alpha=0.5)
Gender Distribution
ggplot(df.train,aes(Sex)) + geom_bar(aes(fill=factor(Sex)),alpha=0.5)
Bold Data Cleaning Bold
We want to fill in missing age data instead of just dropping the missing age data rows.
One way to do this is by filling in the mean age of all the passengers (imputation).
The other way is to see if there is a pattern relationship between any other column and age. Fill the age accordingly.
pl <- ggplot(df.train,aes(Pclass,Age)) + geom_boxplot(aes(group=Pclass,fill=factor(Pclass),alpha=0.4))
pl + scale_y_continuous(breaks = seq(min(0), max(80), by = 2))
## Warning: Removed 177 rows containing non-finite values (stat_boxplot).
We shall use these average age values to impute based on Pclass for Age. Because we see that avg age is different for each class.
impute_age <- function(age,class){
out <- age
for (i in 1:length(age)){
if (is.na(age[i])){
if (class[i] == 1){
out[i] <- 37
}else if (class[i] == 2){
out[i] <- 29
}else{
out[i] <- 24
}
}else{
out[i]<-age[i]
}
}
return(out)
}
fixed.ages <- impute_age(df.train$Age,df.train$Pclass)
df.train$Age <- fixed.ages
missmap(df.train, main="Titanic Training Data - Missings Map",
col=c("yellow", "black"), legend=FALSE)
Now there is no missing values in the data set.
Rather than using all the columns to build the model, let us use specific columns as input signals for the model.
df.train <- df.train[,!(colnames(df.train) %in% c('PassengerId','Name','Ticket','Cabin'))]
Bold Train and Test Data fit: Bold
library(caTools)
set.seed(101)
split = sample.split(df.train$Survived, SplitRatio = 0.70)
final.train = subset(df.train, split == TRUE)
final.test = subset(df.train, split == FALSE)
Let us build the first classifier model using the Logistic Regression Algorithm. The function for Logistic Regression model is glm(). Three parameters to be passed:
Building the Model:
log.model <- glm(formula=Survived ~ . , family = binomial(link='logit'),data = final.train)
Prediction for test data:
fitted.probabilities <- predict(log.model,newdata=final.test,type='response')
Accuracy Measurement:
fitted.results <- ifelse(fitted.probabilities > 0.5,1,0)
misClasificError <- mean(fitted.results != final.test$Survived)
print(paste('Accuracy',1-misClasificError))
## [1] "Accuracy 0.783582089552239"
The LR Classifier Model we have built is Bold 78.4 % Bold accurate in predicting if a new passenger will survive the journey on Titanic.
Let us now build the next classifier using Decision Trees algorithm bold Library : rpart library bold
library(rpart)
tree <- rpart(Survived ~.,method='class',data = final.train)
Test the Model: Use predict() to predict the Purchase value on the test data.
tree.preds <- predict(tree,final.test)
Check the Head of the predicted values. You should notice that you actually have two columns with the probabilities.
head(tree.preds)
## 0 1
## 4 0.07826087 0.9217391
## 6 0.83547558 0.1645244
## 12 0.07826087 0.9217391
## 21 0.83547558 0.1645244
## 22 0.83547558 0.1645244
## 24 0.83547558 0.1645244
Write a simple R Function to convert these two columns into one column to match the original “1/0” Label for the Purchase column.
tree.preds <- as.data.frame(tree.preds)
joiner <- function(x){
if (x>=0.5){
return('1')
}else{
return("0")
}
}
Apply that function to each of the row in the prediction result
tree.preds$Survival <- sapply(tree.preds$'1',joiner)
head(tree.preds)
## 0 1 Survival
## 4 0.07826087 0.9217391 1
## 6 0.83547558 0.1645244 0
## 12 0.07826087 0.9217391 1
## 21 0.83547558 0.1645244 0
## 22 0.83547558 0.1645244 0
## 24 0.83547558 0.1645244 0
Calculate the accuracy By comparing the prediction result with the actual result in Car.test
misClasificError <- mean(tree.preds$Survival != final.test$Survived)
print(paste('Accuracy',1-misClasificError))
## [1] "Accuracy 0.813432835820896"
The Decision Tree Classifier Model we have built is Bold 81.3 % Bold accurate in predicting if a new passenger will survive the journey.
Let us now try to build a Random forest classifier. bold Library: randonForest() bold
library(randomForest)
Build The Model using Train Data:
rf.model <- randomForest(Survived ~ . , data = final.train)
## Warning in randomForest.default(m, y, ...): The response has five or fewer
## unique values. Are you sure you want to do regression?
Test the Model using the Test Data. Print the confusion Matrix for the prediction Vs actual
p <- predict(rf.model,final.test)
joiner <- function(x){
if (x>=0.5){
return('1')
}else{
return("0")
}
}
p<- sapply(p,joiner)
table(p,final.test$Survived)
##
## p 0 1
## 0 150 29
## 1 15 74
Calculate the accuracy:
misClasificError <- mean(p != final.test$Survived)
print(paste('Accuracy',1-misClasificError))
## [1] "Accuracy 0.835820895522388"
The Random forest Classifier Model we have built is 83.5 % accurate in predicting if a passenger will survive the travel or not.
Let us now build the classifier using KNN Algorithm.
bold Data needs to be scaled before we pass it on to building the KNN model bold
# Standarize the dataset using "scale()" R function
standardized.titanic <- scale(df.train[,!(colnames(df.train) %in% c('Sex','Survived','Embarked'))])
standardized.titanic <- data.frame(standardized.titanic)
Add the purchase column to the standardized data frame
standardized.titanic$Survived <- df.train$Survived
head(standardized.titanic)
## Pclass Age SibSp Parch Fare Survived
## 1 0.8269128 -0.5335340 0.4325504 -0.4734077 -0.5021631 0
## 2 -1.5652278 0.6745117 0.4325504 -0.4734077 0.7864036 1
## 3 0.8269128 -0.2315226 -0.4742788 -0.4734077 -0.4885799 1
## 4 -1.5652278 0.4480031 0.4325504 -0.4734077 0.4204941 1
## 5 0.8269128 0.4480031 -0.4742788 -0.4734077 -0.4860644 0
## 6 0.8269128 -0.3825283 -0.4742788 -0.4734077 -0.4778481 0
Train & Test Data Split from the scaled data
library(caTools)
set.seed(101)
split = sample.split(standardized.titanic$Survived, SplitRatio = 0.70)
titanic.train = subset(standardized.titanic, split == TRUE)
titanic.test = subset(standardized.titanic, split == FALSE)
Build the Model bold Function : knn() bold
library(class)
predicted.Survived <- knn(titanic.train[,!(colnames(titanic.train) %in% c('Survived'))],titanic.test[,!(colnames(titanic.test) %in% c('Survived'))],titanic.train$Survived,k=1)
head(predicted.Survived)
## [1] 0 0 0 1 0 1
## Levels: 0 1
Check The Accuracy of the Model:
misClasificError <- mean(predicted.Survived != titanic.test$Survived)
print(paste('Accuracy',1-misClasificError))
## [1] "Accuracy 0.645522388059701"
Now Let us try with k=3, considering 3 neighbors to predict the result:
library(class)
predicted.Survived <- knn(titanic.train[,!(colnames(titanic.train) %in% c('Survived'))],titanic.test[,!(colnames(titanic.test) %in% c('Survived'))],titanic.train$Survived,k=3)
misClasificError <- mean(predicted.Survived != titanic.test$Survived)
print(paste('Accuracy',1-misClasificError))
## [1] "Accuracy 0.671641791044776"
Now Let us try with k=3, considering 5 neighbors to predict the result:
predicted.Survived <- knn(titanic.train[,!(colnames(titanic.train) %in% c('Survived'))],titanic.test[,!(colnames(titanic.test) %in% c('Survived'))],titanic.train$Survived,k=5)
misClasificError <- mean(predicted.Survived != titanic.test$Survived)
print(paste('Accuracy',1-misClasificError))
## [1] "Accuracy 0.705223880597015"
We observe that the accuracy keeps improving. Let us try to find the optimum value of K.
Code a for Loop to build the model from K=1 to 20 and record the error rate.
predicted.Survived = NULL
error.rate = NULL
for(i in 1:20){
set.seed(101)
predicted.Survived <- knn(titanic.train[,!(colnames(titanic.train) %in% c('Survived'))],titanic.test[,!(colnames(titanic.test) %in% c('Survived'))],titanic.train$Survived,k=i)
error.rate[i] <-mean(predicted.Survived != titanic.test$Survived)
}
print(error.rate)
## [1] 0.3432836 0.3470149 0.3320896 0.3320896 0.2947761 0.3059701 0.2947761
## [8] 0.3059701 0.3022388 0.2985075 0.2723881 0.2798507 0.2910448 0.2798507
## [15] 0.2835821 0.2723881 0.2761194 0.2835821 0.2835821 0.2761194
bold Elbow Method: bold
We can plot out the various error rates for the K values. We should see an “elbow” indicating that we don’t get a decrease in error rate for using a higher K. This is a good cut-off point:
To plot some data, we need to have a data frame. Two columns - k value, error rate.
library(ggplot2)
k.values <- 1:20
error.df <- data.frame(error.rate,k.values)
error.df
## error.rate k.values
## 1 0.3432836 1
## 2 0.3470149 2
## 3 0.3320896 3
## 4 0.3320896 4
## 5 0.2947761 5
## 6 0.3059701 6
## 7 0.2947761 7
## 8 0.3059701 8
## 9 0.3022388 9
## 10 0.2985075 10
## 11 0.2723881 11
## 12 0.2798507 12
## 13 0.2910448 13
## 14 0.2798507 14
## 15 0.2835821 15
## 16 0.2723881 16
## 17 0.2761194 17
## 18 0.2835821 18
## 19 0.2835821 19
## 20 0.2761194 20
Let us plot the K value Vs error Rate :
ggplot(error.df,aes(x=k.values,y=error.rate)) + geom_point()+ geom_line(lty="dotted",color='red')
From the graph we observe that k=11 has the least error. So let us find the accuracy for k=11.
predicted.Survived <- knn(titanic.train[,!(colnames(titanic.train) %in% c('Survived'))],titanic.test[,!(colnames(titanic.test) %in% c('Survived'))],titanic.train$Survived,k=11)
error.rate[i] <-mean(predicted.Survived != titanic.test$Survived)
print(paste('Accuracy',1-misClasificError))
## [1] "Accuracy 0.705223880597015"
The Accuracy from the KNN classifier with K Value=13 is 70.5 %.
The summary of the accuracy from the different classifiers we have built:
Either we can conclude that Random Forest is the best classifer in this case.