
In this assignment, I was given a labeled dataset containing spam and not-spam text messages. My task was to create a model that could reliably tell between the two. I stumbled upon this project while finishing my text-mining and analytics course at Coursera.
To keep it short, things didn’t go exactly as planned.
One google search went to another until I stumbled upon various forums and posts that covered the same topic and reached the same results in a drastically simplified manner.
But I reckon, if I want to get started with text mining and analytics, starting small and utilizing already existing libraries and functions (rather than going through the Maths and creating my own) would be a better idea. That said, the goal of this project was still the same.
Employ A Multinomial NaiveBayes Classifier To Filter Between Spam And Not Spam Text Messages.
GOAL
The only difference is the way I achieved this goal, – I didn’t manually laid the logic of NaiveBayes, instead use the already existing library from sklearn.
Terminology
NLTK: Natural Language toolkit, contains a plethora of useful libraries for implementing analysis and manipulation of a text containing natural language.
Train Test Split: A practice in which a part of the available database (usually 20%) is taken as a holdout while the model is trained on the rest 80%. After successfully running on the training data. A final test includes running the algorithms on the holdout data to see how accurate the predictions are on previously unencountered data. (Much like a real-world scenario)
Additive Smoothing: A technique used when training with Naive Baye’s models in which, all of the word tokens have 1 added to their statistical frequency in order to avoid the error of dividing by zero. – A situation that might arise because some new words come up in the testing data that the model wasn’t trained on.
Stop Words: A stop word is a commonly used word (such as “the”, “a”, “an”, “in”)
Corpus: Body of text, singular. Corpora is the plural of this.
Token: Each “entity” that is a part of whatever was split up based on rules. For example, each word is a token when a sentence is “tokenized” into words. Each sentence can also be a token if you tokenized the sentences out of a paragraph.
Data Sources
Spam.csv – Contains the list of 5572 text messages all classified into spam (Sham) or not-spam (ham).
Methodology
Part One: Libraries Used
Aside from the usual stuff (like sklearn and panda) this time, the model also required some proprietary module downloads in order to work. Fair note that I could’ve made the code with their absence, But hey, I like trying out new stuff – and these add-ons do shorten up the code to a fair degree.
Punkt [Sentence Tokenizer] – This tokenizer divides a text into a list of sentences, by using an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences. (must be trained on vast data before deployment)
Stopwords [Word Library] – A simple word library in nltk that contains almost all of the widely used stopwords in the English language.
Part Two: Preprocessing
Importing CSV and creating a panda dataframe can be achieved by.
To start, I removed unnecessary columns (unnamed 2,3,4) and renamed v1 and v2 to label and message respectively. lastly, I converted the literal labels to numeric ones.
i.e
ham (not spam) –> 0
sham (spam) –> 1
This was done because the nltk algorithm that I’m going to run only works with numerical label format and not string format.
Of course, the same goes for the text-message corpus that we’re trying to predict as well. While the computer can read the contents of a message, – It can’t extract any meaning or relation from that.
Not to go too deep, but the way to allow the computer to find a relation between the various words used in a shady scam text message vs that of a normal legit text message is to “tokenize” each and every word message.
Basically, split the sentences and words from the body of the text and treat them as individual items (tokens).
For example,
Hello I'm Akash I like honey
becomes
[Hello, I'm, Akash, I, like, honey]
Nltk then works by then establishing correlations within those tokenized words. You can imagine, monetary fishing messages to have lots of common words (tokens) such as (‘money’,’lottery’, ‘winner’). this is a dummy explanation check out this article for a more in-depth explanation
- Remove punctuation, i.e characters like [!”#$%&'()*+,-./:;<=>[email protected][\]^_`{|}~] and join the rest in list called ‘nopunc‘.
- Remove stop words, check if the words in nopunc have any stopwords in them. Drop them, and add the rest to the list called ‘clean_message‘.
–> Return clean_message.
As hinted previously, the addition of proprietary punkt, and stopwords modules really did help in making the preprocessing shorter and more readable.
Just to check before we proceed, how our tokenization works.
OUTPUT:

Part Three: Train Test Split
The label column was extracted from the dataframe separately and a features series was created which contained the tokenized (processed) version of the messages.
The role of CountVectorizer is to convert the text tokens into a matrix of token counts. For the rest of the train, test split was more or less the same. (80:20)
Part Four: Testing Algorithms
I used only one algorithm for this project, and that was MultinomialNB (Naive Baye’s Multinomial).
Part Five: Accuracy Results
OUTPUT:


To summarize,
MultinomialNB: On Train Dataset: 0.9946152120260264 On Test Dataset: 0.957847533632287
As always, the full code can be found on my Github repository
REVIEW
And there comes to the end of another project. With a projected accuracy of about 96%, I’d say it was a success. Of course, much could be done to increase that metric, Although I wasn’t able to personally test it, Seeing how the Naive Bayes classifier worked out makes me believe that
- The model comes built-in with Additive smoothing.
- It doesn’t require Additive smoothing.
Although it didn’t occur to me initially, latin-1 and ISO 8859-1 are essentially the same, – further. Both UNICODE and ISO-8859 encode ASCII the same way.
The use of punkt and stopwords modules helped in condensing the code (and also made it much more readable).
Tokenization and countverctorizer were the two functions that I learned through this project, First works in creating a tokenized array from raw text, and the latter converts them to a matrix of token counts. – At least that’s what they helped me achieve in this project anyway.
I like the idea of including confusion matrices in my accuracy reports, – rather than just an arbitrary percent, they actually help in understanding how exactly did my algorithms reach that percent.
In the future, I would want to work with TF-IDF, word stemmer, n-grams, word cloud – and the like to improve my analysis because let’s be honest, this one wasn’t that informative. I still don’t know, for example, what separates a spam text message from a genuine one.
Resources
https://towardsdatascience.com/spam-classifier-in-python-from-scratch-27a98ddd8e73
https://randerson112358.medium.com/email-spam-detection-using-python-machine-learning-abe38c889855
https://www.kaggle.com/uciml/sms-spam-collection-dataset

About Me!
An aspiring data scientist with a great interest in machine learning and its applications. I post my work here in the hope to improve over time.