In this Use Case we will be working with a fake advertising data set, indicating whether or not a particular internet user clicked on an Advertisement on a company website.
With this data, we will try to create a model that will predict, whether or not they will click on an ad based off the features of that user.
This data set contains the following features:
Import a few libraries you think you'll need (Or just import them as you go along!)
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
Read in the advertising.csv file and set it to a data frame called ad_data.
ad_data = pd.read_csv('advertising.csv')
Check the head of ad_data
ad_data.head()
Use info and describe() on ad_data
ad_data.info()
ad_data.describe()
Let's use seaborn to explore the data!
Try recreating the plots shown below!
Create a histogram of the Age
sns.set_style('whitegrid')
ad_data['Age'].hist(bins=30)
plt.xlabel('Age')
Create a jointplot showing Area Income versus Age.
sns.jointplot(x='Age',y='Area Income',data=ad_data)
Create a jointplot of 'Daily Time Spent on Site' vs. 'Daily Internet Usage'
sns.jointplot(x='Daily Time Spent on Site',y='Daily Internet Usage',data=ad_data,color='green')
Finally, create a pairplot with the hue defined by the 'Clicked on Ad' column feature.
sns.pairplot(ad_data,hue='Clicked on Ad',palette='bwr')
Split the data into training set and testing set using train_test_split
from sklearn.model_selection import train_test_split
X = ad_data[['Daily Time Spent on Site', 'Age', 'Area Income','Daily Internet Usage', 'Male']]
y = ad_data['Clicked on Ad']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
Train and fit a logistic regression model on the training set.
from sklearn.linear_model import LogisticRegression
logmodel = LogisticRegression()
logmodel.fit(X_train,y_train)
Now predict values for the testing data.
predictions = logmodel.predict(X_test)
Create a classification report for the model.
from sklearn.metrics import classification_report
print(classification_report(y_test,predictions))
from sklearn.tree import DecisionTreeClassifier
dtree = DecisionTreeClassifier()
dtree.fit(X_train,y_train)
predictions = dtree.predict(X_test)
from sklearn.metrics import classification_report,confusion_matrix
print(classification_report(y_test,predictions))
print(confusion_matrix(y_test,predictions))
Now let's build a random forest model
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(n_estimators=100)
rfc.fit(X_train, y_train)
rfc_pred = rfc.predict(X_test)
print(confusion_matrix(y_test,rfc_pred))
print(classification_report(y_test,rfc_pred))
Let us build a KNN classifier to predict the click. Scale the data to standardize.
from sklearn.preprocessing import StandardScaler
#Standardize the data to a common scale
scaler= StandardScaler()
scaler.fit(ad_data[['Daily Time Spent on Site', 'Age', 'Area Income','Daily Internet Usage', 'Male']])
scaled_features= scaler.transform(ad_data[['Daily Time Spent on Site', 'Age', 'Area Income','Daily Internet Usage', 'Male']])
df_feat=pd.DataFrame(scaled_features,columns=['Daily Time Spent on Site', 'Age', 'Area Income','Daily Internet Usage', 'Male'])
Train and Test Data Split on the standardized Data
from sklearn.cross_validation import train_test_split
X= df_feat
y= ad_data['Clicked on Ad']
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3,random_state=101)
Elbow Method To Determine Number of Neighbors
from sklearn.neighbors import KNeighborsClassifier
#Elbow method to determine the optimal K value
error_rate=[]
for i in range(1,20):
knn=KNeighborsClassifier(n_neighbors=i)
knn.fit(X_train,y_train)
pred_i = knn.predict(X_test)
error_rate.append(np.mean(pred_i != y_test))
plt.plot(range(1,20),error_rate)
#Build the Model with 3 as K
knn=KNeighborsClassifier(n_neighbors=12)
knn.fit(X_train,y_train)
# Predict and Evaluate the Model
pred = knn.predict(X_test)
print(confusion_matrix(y_test,pred))
print(classification_report(y_test,pred))
The Accuracy Levels of the Models is observed to be:
Out of the four, KNN Classifier gives out 97% Accuracy. S For the Click Predictor Model we are building, we suggest K Nearest Neighbor Algorithm gives the Best accuracy.