Digital Marketing - Click Predictor Model

Problem Statement

In this Use Case we will be working with a fake advertising data set, indicating whether or not a particular internet user clicked on an Advertisement on a company website.

With this data, we will try to create a model that will predict, whether or not they will click on an ad based off the features of that user.

Data Fields

This data set contains the following features:

  • 'Daily Time Spent on Site': consumer time on site in minutes
  • 'Age': cutomer age in years
  • 'Area Income': Avg. Income of geographical area of consumer
  • 'Daily Internet Usage': Avg. minutes a day consumer is on the internet
  • 'Ad Topic Line': Headline of the advertisement
  • 'City': City of consumer
  • 'Male': Whether or not consumer was male
  • 'Country': Country of consumer
  • 'Timestamp': Time at which consumer clicked on Ad or closed window
  • 'Clicked on Ad': 0 or 1 indicated clicking on Ad

Import Libraries

Import a few libraries you think you'll need (Or just import them as you go along!)

In [36]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

Get the Data

Read in the advertising.csv file and set it to a data frame called ad_data.

In [37]:
ad_data = pd.read_csv('advertising.csv')

Check the head of ad_data

In [38]:
ad_data.head()
Out[38]:
Daily Time Spent on Site Age Area Income Daily Internet Usage Ad Topic Line City Male Country Timestamp Clicked on Ad
0 68.95 35 61833.90 256.09 Cloned 5thgeneration orchestration Wrightburgh 0 Tunisia 2016-03-27 00:53:11 0
1 80.23 31 68441.85 193.77 Monitored national standardization West Jodi 1 Nauru 2016-04-04 01:39:02 0
2 69.47 26 59785.94 236.50 Organic bottom-line service-desk Davidton 0 San Marino 2016-03-13 20:35:42 0
3 74.15 29 54806.18 245.89 Triple-buffered reciprocal time-frame West Terrifurt 1 Italy 2016-01-10 02:31:19 0
4 68.37 35 73889.99 225.58 Robust logistical utilization South Manuel 0 Iceland 2016-06-03 03:36:18 0

Use info and describe() on ad_data

In [39]:
ad_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 10 columns):
Daily Time Spent on Site    1000 non-null float64
Age                         1000 non-null int64
Area Income                 1000 non-null float64
Daily Internet Usage        1000 non-null float64
Ad Topic Line               1000 non-null object
City                        1000 non-null object
Male                        1000 non-null int64
Country                     1000 non-null object
Timestamp                   1000 non-null object
Clicked on Ad               1000 non-null int64
dtypes: float64(3), int64(3), object(4)
memory usage: 78.2+ KB
In [40]:
ad_data.describe()
Out[40]:
Daily Time Spent on Site Age Area Income Daily Internet Usage Male Clicked on Ad
count 1000.000000 1000.000000 1000.000000 1000.000000 1000.000000 1000.00000
mean 65.000200 36.009000 55000.000080 180.000100 0.481000 0.50000
std 15.853615 8.785562 13414.634022 43.902339 0.499889 0.50025
min 32.600000 19.000000 13996.500000 104.780000 0.000000 0.00000
25% 51.360000 29.000000 47031.802500 138.830000 0.000000 0.00000
50% 68.215000 35.000000 57012.300000 183.130000 0.000000 0.50000
75% 78.547500 42.000000 65470.635000 218.792500 1.000000 1.00000
max 91.430000 61.000000 79484.800000 269.960000 1.000000 1.00000

Exploratory Data Analysis

Let's use seaborn to explore the data!

Try recreating the plots shown below!

Create a histogram of the Age

In [41]:
sns.set_style('whitegrid')
ad_data['Age'].hist(bins=30)
plt.xlabel('Age')
Out[41]:
Text(0.5,0,'Age')

Create a jointplot showing Area Income versus Age.

In [42]:
sns.jointplot(x='Age',y='Area Income',data=ad_data)
Out[42]:
<seaborn.axisgrid.JointGrid at 0xeb4fcc0>

Create a jointplot of 'Daily Time Spent on Site' vs. 'Daily Internet Usage'

In [43]:
sns.jointplot(x='Daily Time Spent on Site',y='Daily Internet Usage',data=ad_data,color='green')
Out[43]:
<seaborn.axisgrid.JointGrid at 0xec4c9b0>

Finally, create a pairplot with the hue defined by the 'Clicked on Ad' column feature.

In [44]:
sns.pairplot(ad_data,hue='Clicked on Ad',palette='bwr')
Out[44]:
<seaborn.axisgrid.PairGrid at 0xeb30780>

Train & Test Data Split

Split the data into training set and testing set using train_test_split

In [45]:
from sklearn.model_selection import train_test_split
In [46]:
X = ad_data[['Daily Time Spent on Site', 'Age', 'Area Income','Daily Internet Usage', 'Male']]
y = ad_data['Clicked on Ad']
In [47]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

LOGISTIC REGRESSION

Train and fit a logistic regression model on the training set.

In [48]:
from sklearn.linear_model import LogisticRegression
In [49]:
logmodel = LogisticRegression()
logmodel.fit(X_train,y_train)
Out[49]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

Predictions and Evaluations

Now predict values for the testing data.

In [50]:
predictions = logmodel.predict(X_test)

Create a classification report for the model.

In [51]:
from sklearn.metrics import classification_report
In [52]:
print(classification_report(y_test,predictions))
             precision    recall  f1-score   support

          0       0.87      0.96      0.91       162
          1       0.96      0.86      0.91       168

avg / total       0.91      0.91      0.91       330

DECISION TREES

In [53]:
from sklearn.tree import DecisionTreeClassifier
dtree = DecisionTreeClassifier()
dtree.fit(X_train,y_train)
Out[53]:
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')
In [54]:
predictions = dtree.predict(X_test)
from sklearn.metrics import classification_report,confusion_matrix
print(classification_report(y_test,predictions))
             precision    recall  f1-score   support

          0       0.94      0.94      0.94       162
          1       0.94      0.95      0.94       168

avg / total       0.94      0.94      0.94       330

In [55]:
print(confusion_matrix(y_test,predictions))
[[152  10]
 [  9 159]]

RANDOM FOREST

Now let's build a random forest model

In [56]:
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(n_estimators=100)
rfc.fit(X_train, y_train)
Out[56]:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)
In [57]:
rfc_pred = rfc.predict(X_test)
print(confusion_matrix(y_test,rfc_pred))
[[156   6]
 [ 10 158]]
In [58]:
print(classification_report(y_test,rfc_pred))
             precision    recall  f1-score   support

          0       0.94      0.96      0.95       162
          1       0.96      0.94      0.95       168

avg / total       0.95      0.95      0.95       330

KNN CLASSIFIER MODEL

Let us build a KNN classifier to predict the click. Scale the data to standardize.

In [59]:
from sklearn.preprocessing import StandardScaler
#Standardize  the data to a common scale
scaler= StandardScaler()
scaler.fit(ad_data[['Daily Time Spent on Site', 'Age', 'Area Income','Daily Internet Usage', 'Male']])
scaled_features= scaler.transform(ad_data[['Daily Time Spent on Site', 'Age', 'Area Income','Daily Internet Usage', 'Male']])
df_feat=pd.DataFrame(scaled_features,columns=['Daily Time Spent on Site', 'Age', 'Area Income','Daily Internet Usage', 'Male'])

Train and Test Data Split on the standardized Data

In [60]:
from sklearn.cross_validation import  train_test_split
X= df_feat
y= ad_data['Clicked on Ad']
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3,random_state=101)

Elbow Method To Determine Number of Neighbors

In [61]:
from sklearn.neighbors import KNeighborsClassifier
#Elbow method to determine the optimal K value
error_rate=[]
for i in range(1,20):
 knn=KNeighborsClassifier(n_neighbors=i)
 knn.fit(X_train,y_train)
 pred_i = knn.predict(X_test)
 error_rate.append(np.mean(pred_i != y_test))
 
plt.plot(range(1,20),error_rate)
Out[61]:
[<matplotlib.lines.Line2D at 0x10dd8c50>]
In [62]:
#Build the Model with 3 as K
knn=KNeighborsClassifier(n_neighbors=12)
knn.fit(X_train,y_train)

# Predict and Evaluate the Model
pred = knn.predict(X_test)
print(confusion_matrix(y_test,pred))
print(classification_report(y_test,pred))
[[157   0]
 [  8 135]]
             precision    recall  f1-score   support

          0       0.95      1.00      0.98       157
          1       1.00      0.94      0.97       143

avg / total       0.97      0.97      0.97       300

Conclusion

The Accuracy Levels of the Models is observed to be:

  1. Decision Tree Classifier - 94%
  2. Random Forest Classifier - 95%
  3. Logistic Regression Classifier - 91%
  4. KNN Classifier - 97%

Out of the four, KNN Classifier gives out 97% Accuracy. S For the Click Predictor Model we are building, we suggest K Nearest Neighbor Algorithm gives the Best accuracy.