This post is from a two-part series where I dwell deep into the housing prices of Boston – from the Boston Housing Dataset. (Courtesy U.S Census Service and Dr.Jason)
In an attempt to continue my interest in studying and analyzing the housing market and the various factors that go into its functioning. I picked up the housing market data available on Kaggle.
CRIM: Per capita crime rate by town ZN: Proportion of residential land zoned for lots over 25,000 sq. ft INDUS: Proportion of non-retail business acres per town CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise) NOX: Nitric oxide concentration (parts per 10 million) RM: Average number of rooms per dwelling AGE: Proportion of owner-occupied units built prior to 1940 DIS: Weighted distances to five Boston employment centers RAD: Index of accessibility to radial highways TAX: Full-value property tax rate per $10,000 PTRATIO: Pupil-teacher ratio by town B: 1000(Bk — 0.63)², where Bk is the proportion of [people of African American descent] by town LSTAT: Percentage of the lower status of the population MEDV: Median value of owner-occupied homes in $1000s
I ended up changing the name MEDV (Mean Property Value) to ‘Target’ In order to somewhat simplify the graphs pursuing the analysis.
This “quirky” array of features present in the dataset prove to be rather interesting as they were first collected with the aim of evaluating if the presence of cleaner air had any noticeable effect on the market price of the house.
I, on the other hand, am interested in all of the recorded factors and how they went into affecting the housing price.
Housing.csv– Published originally in 1978, in a paper titled `Hedonic prices and the demand for clean air’, this data set contains the data collected by the U.S Census Service for housing in Boston, Massachusetts. [506 Rows and 14 Columns]
All the important technical terms in the article are highlighted in bold and are explained here.
HEATMAP: A heatmap (also known as correlation matrix) is a column by column matrix that visually represents to what degree variables of one column depend on the other. With higher positive correlations scoring towards 1.0 whereas higher negative correlation around -1.0
Normal Distribution: (Statistics) In a normal distribution the mean is zero and the standard deviation is 1. It has zero skew and kurtosis of 3. Normal distributions are symmetrical.
Multinomial Distribution: (Statistics) The multinomial distribution is the type of probability distribution used to calculate the outcomes of experiments involving two or more variables.
Outliers: (Statistics) An outlier is an observation that lies an abnormal distance from other values in a random sample from a population.
Industrial Area: High proportion of non-retail business area per town as defined by the original research.
Student towns: Towns or communities where the majority of the local population consists of students/pupils. Also known as “university towns”,
PART ONE: Libraries Used
PART TWO: Setting Up Dataframe
Although unlikely, I did run the data through a notnull() check just to be sure of any unwanted surprises.
So that was a relief. The next part is dedicated to analyzing the data for valuable insights and patterns.
PART TWO: Data Analysis
First I wanted to get a general idea of how the data was distributed across the various fields, which compelled me to first pull up the distribution box plots and distribution curves respectively.
Upon first glance, it’s pretty clear that some features have a lot more outliers than others (CRIM, ZN, CHAS, RM, B, Target) with CHAS particularly standing out as having 100% outliers. – I ran an outlier confirmation script and the result was affirmative.
Let’s check out the distribution curves for a better understanding of what’s going on.
RM is the only amongst all to have a perfect Normal distribution curve while INDUS, RAD, And TAX features multinomial distribution. (Note: I mean that their plots resemble the standard distribution curves and not that they are actually such distribution).
Like inferred previously, CRIM, ZN, RM, B, and CHAS consisting mainly of outliers and heavily skewed distribution. – Interestingly Target managed to avoid being completely skewed, perhaps due to being correlated with all the other factors.
Next up, A heatmap for correlation.
Focusing on the last column on the far right, It’s easy to see that a majority of columns actually have a negative correlation with the target house value. (With the only exceptions being B, DIS, RM, and ZN) This does not mean, however, that the rest are useless. A strong negative correlation can be just as effective at modeling regression models as positive ones.
lastly, correlation curves of the features.
I excluded CHAS from the graphs above because most of its data points were out of range. (And it had no meaningful correlation
Looking at the plots, it’s pretty clear to see that out of the selected features, only a few show a clear positive correlation with the housing prices. RM – or, No. of rooms per house, which makes perfect sense. (a 3BHK house will most certainly be more expensive than a 1BHK one), ZN and B also have a positive correlation with housing prices. (I’ll discuss them later)
What was more interesting, however, was the inverse relation with TAX (full-value property tax rate per $10,000) – this would imply that the highly assessed properties on the market have a lower tax levied on them (and vice versa). Although not completely perfect, the correlation does seem to hold up. The peculiar entries of low valued but highly taxed properties on the right extreme do seem rather odd though. Are they just outliers? Only further probing might tell.
Other useful insights include:
Houses in industrial areas seem to cost cheaper.INDUS
Residents around these areas are likely factory workers and so the average income is probably on the lower end. It is also highly likely that well-off patrons are more likely to avoid properties around the noise and pollution of factories.
Student towns tend to have cheaper housing prices.PTRATIO
This conclusion is rather intuitive, students do not need lavish four-bedroom penthouses and so the market in these special economic zones have small cheap apartments and housing. Of course, the correlation is rather dicey amongst all, likely due to the prices of normal houses.
Areas with a higher concentration of people of lower status equated to lower housing prices.LSTAT
No surprises here, just the invisible hand of the capitalist market at work. It is important to point out that the relationship is most likely of order two instead of a linear one owing to the recognizable shape of the graph.
Regions with pilling nitric oxide concentration suffered lower average property value.NOX
This inference does align with the initial hypothesis the original research started with. that is, Air quality does affect the prices of houses in an area. – of course, not in a positive way.
Older houses tend to cost less.AGE
The term “older” here in this context means houses that were built prior to the 1940s (data was collected during the 1970s) – so about 30 years old that is owner-occupied. While there is a noticeable downward trend, there’s also a good number of outliers that seem impervious. Perhaps these are the more lavish bungalows built by the nobility of the time and are hence still taken care of with great diligence hence by them, therefore, retaining their higher market value. After all, a lack of proper maintenance is one of the major factors contributing to the depreciation of house prices. – Which nicely explains the vast majority of the other data points on the graph.
Larger residential zones have properties of higher value.ZN
Residential zones typically include co-op and rental apartments, condominiums (aka. condos), mobile home parks, and single-family homes. Some residential zones may also allow for home-based businesses. Therefore, it makes sense for larger residential plots to consist of properties of proportionally high value. – Though the cluster of data points at the left extreme does strike me as odd, as of writing, I’m unable to explain them.
Houses farther away from major employment centers cost moreDIS
A bit of an oddball for sure, one would think that people will pay huge premiums to own houses close to their employment. However, Realizing that this was about the time when cars and transport infrastructure in America were around its peak helps explain why the suburban housing market boomed. As more and more Americans retrieved from the hustle-bustle of the city to the more peace and tranquility of the countryside. these people (usually the upper-working class) were more willing to shell out a premium for this luxury.
As always, the entire code can be found on my GitHub page.
WHAT I’VE LEARNED
Finally, I was able to use some of my plotting skills! – something I dearly missed in my last data analysis. It’s by no means perfect, But hey, it’s a start. Using seaborn libraries to plot heatmaps and customize their visual appearance was also a skilled picked up during this project, and so was successfully running not-null and outlier checks (Although I will fully utilize the latter one in the 2nd part of this post)
I also learned a lot about gauging box-plots, distribution curves, and inference and a whole lot about 70s Boston housing markets.
In the next part of the series, I’ll be going over the machine learning side of the fence (aka. predicting housing prices using regression models)
An aspiring data scientist with a great interest in machine learning and its applications. I post my work here in the hope to improve over time.
Pingback: Predicting Housing Prices of 1970’s Boston – BinarySpoon