This post is from a two-part series where I dwell deep into the housing prices of Boston – from the Boston Housing Dataset. (Courtesy U.S Census Service and Dr.Jason)
After the analysis of the factors present in the Boston Housing database and how they affected the property value in the previous post. In this continuation, We’re going to use the same database but this time train regression models to predict the label value instead.
Housing.csv– Published originally in 1978, in a paper titled `Hedonic prices and the demand for clean air’, this data set contains the data collected by the U.S Census Service for housing in Boston, Massachusetts. [506 Rows and 14 Columns]
All the important technical terms in the article are highlighted in bold and are explained here.
Outliers: (Statistics) An outlier is an observation that lies an abnormal distance from other values in a random sample from a population.
Log Transformation – (Data Science) A data transformation method in which it replaces each variable x with a log(x). log transformation reduces or removes the skewness of our original data.
Part one: Libraries Used
Part Two: Setting Up Dataframe
This was done largely the same as the one in my previous post.
Just like last time, I ran the data through an outlier search.
Again, the output was.
This time, however, I removed the ones present in the target label column, as well as drop feature columns from the analysis that had a considerably high outlier percentage.
Log transformation, without going into to much detail. I used log transformation (specifically np.log1p which takes the natural log of x+1) to normalize the curve of the target label and remove skewness from the data, before feeding it to the training models to help improve accuracy.
Part Three: Train Test Split
Part Four: Testing Algorithms
I choose four training models for this task, Support Vector Regression, Gradient Boosting Regression, Decision Tree Regression, and Polynomial Linear Regression. I tried these because I haven’t have tried many of them before (this is my first time doing regression) and as such was not aware of ALL the alternatives available.
Part Five: Accuracy Results
Gradient booster scored the highest with Support Vector Regression coming, at last, removing outliers, doing logarithm transformation, and normalizing the training data (standard scaler) boosted the score of all models (especially Gradient Booster) by 4-6%. Though the final scores leave a lot to be desire, At least I ended up above the 85% mark – so I’m gonna leave it there for now. Of course, If I come across even more novel ways of increasing accuracy. I would update the post accordingly.
As always, the whole code can be found on my GitHub Page
WHAT I’VE LEARNED
Getting around to training regression models was a nice change from my previously predominantly classification work. I learned about four new regression models (and their optimized settings) as well as integrated them with my previously learned make_pipeline() workflow. It was also interesting to see how regression outcomes are affected by changing the scaling of the dataset. (i.e switching between StandardScaler() and MinMaxScaler()).
The important new achievement was learning two new tricks for optimizing data prior to feeding it for modeling. Outlier percentage, and Logarithmic Transformation, – The lather of which improved accuracy by quite a bit, Although I’m still quite cloudy at the mathematics behind it.
An aspiring data scientist with a great interest in machine learning and its applications. I post my work here in the hope to improve over time.