Objective:
We will be building a SMS spam detector. The input data we have, to train the model is a file containing sms data and the classification label. Using this we shall build a Naive Bayes classifier model which will detect sms to be ham/spam.
Import Data from SMSSpamCollection file.
Upload this file in your jupyter notebook directory and call the file directly.
Use pandas library to read data from the csv file.
import pandas as pd
%matplotlib inline
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
msg=pd.read_csv('C:\\Users\\Mohamed Kani\\Desktop\\Data_Science\\NLP\\SMSSpamCollection',sep='\t',names=['label','msgs'])
msg.describe()
Let us see the summary based on the labels column. group by label column and call describe function.
msg.groupby('label').describe()
Create a new column which will store the length of the each sms.
Let us plot the length distribution of the sms across labels.
msg['length']=msg['msgs'].apply(len)
sns.distplot(msg['length'])
The plot clearly shows that there are sms with very high number of words (>600).
Let us try to find the maximum length of sms. And also check what sms is that.
msg['length'].describe()
msg[msg['length']==910]['msgs'].iloc[0]
msg.hist(column='length',by ='label')
Clearly, it is seen that the spam messages are long with more number of words. (peak =160)
The actual messages are short in length with average length around 50.
The text data need to be cleaned before we use that to train the model. Let us do the below to extract significant words:
import nltk
nltk.download('stopwords')
import string
from nltk.corpus import stopwords
def text_process(mess):
"""
1. Remove punc
2. Remove stopwords
3. Return clean text words list
"""
nopunc = [char for char in mess if char not in string.punctuation]
nopunc = ''.join(nopunc)
return [word for word in nopunc.split() if word.lower() not in stopwords.words('english')]
msg['msgs'].head(3).apply(text_process)
1.Count how many times does a word occur in each message (Known as term frequency)
2.Weigh the counts, so that frequent tokens get lower weight (inverse document frequency)
TF-IDF stands for term frequency-inverse document frequency, and the tf-idf weight is a weight often used in information retrieval and text mining. This weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus.
The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus. Variations of the tf-idf weighting scheme are often used by search engines as a central tool in ranking a document's relevance given a user query.
from sklearn.feature_extraction.text import CountVectorizer
bow_transformer = CountVectorizer(analyzer=text_process).fit(msg['msgs'])
print(len(bow_transformer.vocabulary_))
Now we need to split the data to train & test sets which can be used to build and evaluate the model.
from sklearn.cross_validation import train_test_split
msg_train,msg_test,label_train,label_test = train_test_split(msg['msgs'],msg['label'],test_size=0.3)
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
msg_bow = bow_transformer.transform(msg_train)
tfidf_transformer = TfidfTransformer().fit(msg_bow)
msg_tfidf=tfidf_transformer.transform(msg_bow)
spam_detect_model = MultinomialNB().fit(msg_tfidf,label_train)
msg_test_bow = bow_transformer.transform(msg_test)
tfidf_test_transformer = TfidfTransformer().fit(msg_test_bow)
msg_test_tfidf=tfidf_transformer.transform(msg_test_bow)
all_predict=spam_detect_model.predict(msg_test_tfidf)
from sklearn.metrics import classification_report
print(classification_report(label_test,all_predict))
Thus, the model we built has turned out to be 96% accurate.
from sklearn.pipeline import Pipeline
pipeline = Pipeline([
('bow', CountVectorizer(analyzer=text_process)), # strings to token integer counts
('tfidf', TfidfTransformer()), # integer counts to weighted TF-IDF scores
('classifier', MultinomialNB()), # train on TF-IDF vectors w/ Naive Bayes classifier
])
pipeline.fit(msg_train,label_train)
predictions = pipeline.predict(msg_test)
print(classification_report(predictions,label_test))
This model thus can classify the new sms being passed in as input into ham and spam categories with 97% accuracy.