#4 Binary Classification: Prediction On Breast Cancer Datasets

This time around, I decided to go on a solo machine learning adventure! – That is, I choose the data and goal for myself from the myriad machine learning challenges available on the web.

Since this was my first time, I went for one of the classiest applications first.

No not letter recognitions (although I do plan on doing that in the future) I’m talking about breast cancer prediction, Aka. the bread and butter of machine learning applications – something that has great practical implications, (It is shown that an early diagnosis of Breast Cancer can improve the prognosis and chance of survival significantly). On top of having a beginner-friendly nature. My goal for this solo project was…

Train a machine learning model that is adept at predicting the likelihood of breast cancer


Terminology Used

Benign – (In medical context) A disease that is not harmful in its nature to the patient. Here, A benign diagnosis means that the tumor is in its early infancy, or that there is no tumor at all.

Malignant – (In medical context) A disease that is very infectious and harmful in its nature to the patient. Here, A malignant diagnosis means that the tumor is in the late stage and has started to invade other tissues.

Cytological –  The medical and scientific study of cells. Cytology refers to a branch of pathology, the medical specialty that deals with making diagnoses of diseases and conditions through the examination of tissue samples from the body.

Mean – (In statistical context) The mean (average) of a data set is found by adding all values in the dataset and then dividing by the number of values in the set.

Standard error – (In statistical context) the standard error is a statistical term that measures the accuracy with which a sample distribution represents a population by using standard deviation.

Extreme Value – (In statistical context) An extreme value, is the smallest or largest value of a function in its neighborhood domain.

Vector – (In mathematical context) a matrix with one row or one column.

Data Sources Used

Breast Cancer Wisconsin (Diagnostic) Database created by William H. Wolberg, a physician at the University of Wisconsin Hospital, USA. Mr.William, using fluid samples from patients with solid breast masses and the graphical computer program Xcyt was able to come up with this dataset.

The program was capable of performing analysis of cytological features based on digital scans. Using a curve-fitting algorithm, Xcyt computed ten features from each one of the cells in the sample, then calculated the mean value, extreme value, and standard error of each feature for the image, returning a 30 real-valuated vector. The final database contains 569 rows and 32 columns.


Part One: Libraries Used

Part Two: Setting Up Dataframe

Since the breast cancer dataset didn’t come in a dataframe format by default, and that machine learning algorithms work better on Pandas Dataframe. I decided to convert it into one first. That is, I loaded the breast cancer data, then extracted the names of all the 30 features in the ‘column’ list. After this, proceeded to create a new dataframe in which the information was loaded onto.

Part Three: Train Test Split

The resulting breast cancer dataframe obtained above was further split into testing and training phases.

Part Four: Testing Algorithms

This was the fun part, in the interest of trying out something new. I used Scikit’s make_pipeline module – it had the added benefit of shorting the process of scaling the training and testing data. (Before I had to apply scale on both individually),

Part Five: Accuracy Results

Logistic Regression: 0.958041958041958
Support Vector Classifier: 0.965034965034965
Knearest neighbors: 0.951048951048951
Random Forests: 0.965034965034965

It is rather interesting to see that SVC and random forest scored the exact same on the accuracy metric. – The reason? No idea, Your guess is as good as mine. Here’s the result without any standard scaling.

Logistic Regression: 0.958041958041958
Support Vector Classifier: 0.6293706293706294
Knearest neighbors: 0.9370629370629371
Random Forests: 0.965034965034965

SVC’s great reliance on standard scaled data clearly shows, with an accuracy increment of approximately 33.2%. Putting it right alongside Random forest. Also interesting to see, that, logistic regression showed virtually no impact throughout the different scales. (Though the indifference of Random forest is completely understandable)

Like always, The full code in its entirety can be found here.

What I’ve Learned

This project was more on the smaller side – thanks in part because of the highly meticulous research databases created by Mr.William and others like him. The major takeaway from this will be the use of the make_pipleline module from Scikit – and especially how to streamline it makes the fitting and testing phases. Besides that, the first-hand witnessing of how standard scaling effects model performance (especially SVC).

About Me!

An aspiring data scientist with a great interest in machine learning and its applications. I post my work here in the hope to improve over time.