
The following is a part of my machine learning project log completed during the applied machine learning introduction course from Coursera.
PREMISE
The Michigan Data Science Team (MDST) and the Michigan Student Symposium for Interdisciplinary Statistical Sciences (MSSISS) have partnered with the City of Detroit to help solve one of the most pressing problems facing Detroit – blight.
Blight violations are issued by the city to individuals who allow their properties to remain in a deteriorated condition.
– City Of Detroit
Every year, the city of Detroit issues millions of dollars in fines to residents, and every year, many of these fines remain unpaid. Enforcing unpaid blight fines is a costly and tedious process, so the city wants to know: how can we increase blight ticket compliance?
GOAL
Develop a machine learning model that can predict the likelihood of blight ticket compliance.
Goal
Terminology Used
Blighted Property: A property that is left untended for many years which leads to its eventual deterioration. Usually because either the owners have passed away, or are in no financial condition to afford the renovation.
Violator: The individual, or business, to which, the violation charges are issued. Are usually the owner or inheritor of the property.
Ticket_id: The unique id number that is issued to each blight violator. – Primarily used as a unique index value.
Lat and lon: The latitude and longitude coordinates of the blighted property. Can be used to pinpoint said property on a geographical map.
Compliance: The outcome in which the resident to which the blight ticket was issued to, complies with the authorities and end up paying all due fines and fees.
Data Sources Used
Train.csv: Provided by Detroit Open Data Portal, contains rows of blight tickets and information about when, why, and to whom each ticket was issued from 2004 to 2011. As the name suggests, I used it as training data.
Test.csv, Provided by Detroit Open Data Portal, contains rows of blight tickets and preliminary information about when, why, and to whom each ticket was issued from 2012 to 2016. This will serve as the test data which I aim to make predictions on.
Addresses.csv, Contains the addresses of each blighted property indexed by ticket id.
Latlon.csv, Contains the latitude and longitude of the address of each blighted property indexed by address.
Methodology
Part 1: Importing Libraries
Part 2: Reading Data,
The first part was loading all the relevant data for the training phase. this included.
- Train.csv
- Addresses.csv
- Latlon.csv
Since some court hearings ultimately resulted in a ‘not-responsible’ outcome, parts of the data have null values in its label column.
Further, since most data is in form of personal statements and details, padding null values were out of the equation so applied a simple not null mask.
Lastly, since address_data and latlons_data included important info about the exact geographic location of each violation ticket (in numeric-form no less), they were joined to the training and testing dataset.
Note [30/11/2020]
I now realize that ISO-8859-1 and UTF-8 both encode ASCII in exactly the same way. (but UTF-8 is more flexible in that it supports languages with more than 128 symbols) – This realization, however, will not affect the results of this post in any way. I just wanted to put up a clarification for future reference.
Part 3: Postprocessing.
Removing features from the training dataset that are absent in the testing dataset to prevent overfitting. Next, removing columns that contain string entries as labeling every single one of them would prove to be too resource-intensive and will provide little improvements. (Most of them are personal details that can’t be generalized effectively.) Lastly, padding some entries of lat and lon columns that are filled with NaN. (perhaps due to mismatch caused by spelling errors).
Part 4: Training And Testing dataset.
First, Derived X_train and y_train dataset from the training data. No need to use train_test_split() this time as separate datasets were provided as part of the project. The datasets were, however, scaled using MinMaxScaler (which basically caps all feature entries between 0 to 1) for better performance as we are going to run them several times through several prediction models.
Part 5: Fitting Models.
This was the final part, here, with the data all sorted and trimmed for analysis. I ran it through various classification model configurations. such as,
- Support Vector Classification.
- Linear Regression.
- K-Nearest neighbor.
- MLPClassifier (Neural Network).
Obviously, I didn’t go through every single classification model but only the ones that I had a general hunch would be a good fit (both in accuracy and computation time). Here are the scores,
Support Vector Classification
(kernel = ‘rbf, gamma = 0.01, c =0.1)
ACCURACY: 0.76
LogisticRegression
(default)
ACCURACY: 0.71
K-Nearest Neighbors
(n_neighbor = 5)
ACCURACY: 0.65
MLP Classifier
(hidden_layer_sizes=[100,10],alpha=0.001,random_state=0,solver=’lbfgs’,verbose=0)
ACCURACY: 0.96 (Best)
Part 6: Returning Predictions
The code in its entirety can be found here,
What I’ve Learned
As mentioned at the start, this project was a part of the Machine Learning With Python Course at Coursera. Overall, it was a pretty nice learning experience. Being my first major exposure to the field,
Through the work of the final project, I learned how to split data for cross-validation, various types of prediction models, – their strengths and weaknesses as well as how to tweak them for maximum value. I also learned how to deal with moderately large datafiles and sort out only the most useful features.
The later of which taught me the importance of using ISO-8859-1 (or Latin-1) encoding (especially for large files), the amount of time it shaved off as compared to the standard python engine was immense.
This analysis was rather short since y_test labels weren’t provided I wasn’t able to test precision, recall, or the like. For the future, I would like to also start posting ROC_curves. I still have some time before I actually understand how to get it to work effectively.

About Me!
An aspiring data scientist with a great interest in machine learning and its applications. I post my work here in the hope to improve over time.