NLP (Natural Language Processing) with Python

Spam Detector Model

Objective:

We will be building a SMS spam detector. The input data we have, to train the model is a file containing sms data and the classification label. Using this we shall build a Naive Bayes classifier model which will detect sms to be ham/spam.

Step 1: Data Import

Import Data from SMSSpamCollection file.

Upload this file in your jupyter notebook directory and call the file directly.

Use pandas library to read data from the csv file.

In [2]:
import pandas as pd
%matplotlib inline
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

msg=pd.read_csv('C:\\Users\\Mohamed Kani\\Desktop\\Data_Science\\NLP\\SMSSpamCollection',sep='\t',names=['label','msgs'])
msg.describe()
Out[2]:
label msgs
count 5572 5572
unique 2 5169
top ham Sorry, I'll call later
freq 4825 30

Step 2: Exploratory Data Analysis

Let us see the summary based on the labels column. group by label column and call describe function.

In [3]:
msg.groupby('label').describe()
Out[3]:
msgs
count unique top freq
label
ham 4825 4516 Sorry, I'll call later 30
spam 747 653 Please call our customer service representativ... 4

Create a new column which will store the length of the each sms.

Step 3: Data Visualization

Let us plot the length distribution of the sms across labels.

In [4]:
msg['length']=msg['msgs'].apply(len)
sns.distplot(msg['length'])
Out[4]:
<matplotlib.axes._subplots.AxesSubplot at 0x956f908>

The plot clearly shows that there are sms with very high number of words (>600).

Let us try to find the maximum length of sms. And also check what sms is that.

In [5]:
msg['length'].describe()
Out[5]:
count    5572.000000
mean       80.489950
std        59.942907
min         2.000000
25%        36.000000
50%        62.000000
75%       122.000000
max       910.000000
Name: length, dtype: float64
In [6]:
msg[msg['length']==910]['msgs'].iloc[0]
Out[6]:
"For me the love should start with attraction.i should feel that I need her every time around me.she should be the first thing which comes in my thoughts.I would start the day and end it with her.she should be there every time I dream.love will be then when my every breath has her name.my life should happen around her.my life will be named to her.I would cry for her.will give all my happiness and take all her sorrows.I will be ready to fight with anyone for her.I will be in love when I will be doing the craziest things for her.love will be when I don't have to proove anyone that my girl is the most beautiful lady on the whole planet.I will always be singing praises for her.love will be when I start up making chicken curry and end up makiing sambar.life will be the most beautiful then.will get every morning and thank god for the day because she is with me.I would like to say a lot..will tell later.."
In [7]:
msg.hist(column='length',by ='label')
Out[7]:
array([<matplotlib.axes._subplots.AxesSubplot object at 0x00000000050EA0F0>,
       <matplotlib.axes._subplots.AxesSubplot object at 0x0000000009638390>],
      dtype=object)

Clearly, it is seen that the spam messages are long with more number of words. (peak =160)

The actual messages are short in length with average length around 50.

Step 4: Data Cleaning and Preparation (Text Preprocessing)

The text data need to be cleaned before we use that to train the model. Let us do the below to extract significant words:

  1. Remove puctuations
  2. Remove stopwords
In [20]:
import nltk
nltk.download('stopwords')
import string
from nltk.corpus import stopwords
def text_process(mess):
    """
    1. Remove punc
    2. Remove stopwords
    3. Return clean text words  list
    """
    
    nopunc = [char for char in mess if char not in string.punctuation]
    nopunc = ''.join(nopunc)
    return [word for word in nopunc.split() if word.lower() not in stopwords.words('english')]

msg['msgs'].head(3).apply(text_process)
[nltk_data] Downloading package stopwords to C:\Users\Mohamed
[nltk_data]     Kani\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
Out[20]:
0    [Go, jurong, point, crazy, Available, bugis, n...
1                       [Ok, lar, Joking, wif, u, oni]
2    [Free, entry, 2, wkly, comp, win, FA, Cup, fin...
Name: msgs, dtype: object

Step 5: Vectorization

1.Count how many times does a word occur in each message (Known as term frequency)

2.Weigh the counts, so that frequent tokens get lower weight (inverse document frequency)

TF-IDF stands for term frequency-inverse document frequency, and the tf-idf weight is a weight often used in information retrieval and text mining. This weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus.

The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus. Variations of the tf-idf weighting scheme are often used by search engines as a central tool in ranking a document's relevance given a user query.

In [21]:
from sklearn.feature_extraction.text import CountVectorizer

bow_transformer = CountVectorizer(analyzer=text_process).fit(msg['msgs'])

print(len(bow_transformer.vocabulary_))
11425

Step 6: Train & Test Data Preparation

Now we need to split the data to train & test sets which can be used to build and evaluate the model.

In [23]:
from sklearn.cross_validation import train_test_split
msg_train,msg_test,label_train,label_test = train_test_split(msg['msgs'],msg['label'],test_size=0.3)

Step 7: Build the Model using Train data

In [24]:
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB

msg_bow = bow_transformer.transform(msg_train)
tfidf_transformer = TfidfTransformer().fit(msg_bow)
msg_tfidf=tfidf_transformer.transform(msg_bow)
spam_detect_model = MultinomialNB().fit(msg_tfidf,label_train)

Step 8: Evaluate the model using Test Data

In [25]:
msg_test_bow = bow_transformer.transform(msg_test)
tfidf_test_transformer = TfidfTransformer().fit(msg_test_bow)
msg_test_tfidf=tfidf_transformer.transform(msg_test_bow)
all_predict=spam_detect_model.predict(msg_test_tfidf)

from sklearn.metrics import classification_report
print(classification_report(label_test,all_predict))
             precision    recall  f1-score   support

        ham       0.95      1.00      0.98      1455
       spam       1.00      0.66      0.80       217

avg / total       0.96      0.96      0.95      1672

Thus, the model we built has turned out to be 96% accurate.

Using pipeline feature in sklearn to build the model

In [26]:
from sklearn.pipeline import Pipeline

pipeline = Pipeline([
    ('bow', CountVectorizer(analyzer=text_process)),  # strings to token integer counts
    ('tfidf', TfidfTransformer()),  # integer counts to weighted TF-IDF scores
    ('classifier', MultinomialNB()),  # train on TF-IDF vectors w/ Naive Bayes classifier
])
In [27]:
pipeline.fit(msg_train,label_train)
Out[27]:
Pipeline(memory=None,
     steps=[('bow', CountVectorizer(analyzer=<function text_process at 0x000000000B7A2F28>,
        binary=False, decode_error='strict', dtype=<class 'numpy.int64'>,
        encoding='utf-8', input='content', lowercase=True, max_df=1.0,
        max_features=None, min_df=1, ngram_range=(1, 1), preprocesso...f=False, use_idf=True)), ('classifier', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))])
In [28]:
predictions = pipeline.predict(msg_test)
In [29]:
print(classification_report(predictions,label_test))
             precision    recall  f1-score   support

        ham       1.00      0.95      0.98      1525
       spam       0.68      1.00      0.81       147

avg / total       0.97      0.96      0.96      1672

This model thus can classify the new sms being passed in as input into ham and spam categories with 97% accuracy.