An interesting neighborhood comparison metric is the strength of the schools in each area. For this
evaluation the ACT scores were obtained from the California Department of Education website which
contains the school name and ACT average score for four sections - Reading, English, Math, Science.
This raw data needed to be processed and cleaned to extract information features using the following
workflow: First, utilize the google maps API to reverse school name to geolocation. Second, remove
the record if the geolocation couldn't be found by google maps API. Third, aggregate the records based
on neighborhoods. Contrary to the two other two datasets (Crime and Housing), ACT datasets contain
only 211 records because there are around 200 high schools in LA.
The pipeline for getting the crime data to MySQL consisted of the following steps: First the dataset
“Crime_Data_from_2010_to_2019.csv” was downloaded from the https://data.lacity.org/A-Safe-
City/Crime-Data-from-2010-to-2019/63jg-8b9z page. The whole dataset contained 2 million records,
covering years 2010-2019, using all of which proved too burdensome, hence only records for the year
of 2019 were kept, totalling about 200k records. Those records already contained latitude and
longitude, hence the second step consisted of using the google maps API reverse geocode method
and obtaining the neighborhood in which the crime happened. Unfortunately the API has limitations,
which stopped the queries after around 100k had been sent. This forced the team to resort to using a
KNN algorithm to predict the neighborhoods for the rest of the data, which was step three. A train-test
split showed this approach to be highly effective, as neighborhoods are by definition linearly separable
classes in a low (two) dimensional space, about as ideal a setting for a KNN algorithm as one could
hope for, resulting in a classification accuracy of 97%. The KNN algorithm’s excellent results can be
verified in the image below, which visually compares the neighborhoods obtained from the Google
Maps API (about 100K samples, all of which were used to train the algorithm) , and the neighborhoods
from KNN (about 100k samples). Step four was the creation of a feature called
“Crime_Weighted_Norm”, which was obtained by converting the “Crime Code” associated with each
register. “Crime Code” indicates the crime committed, with lower crime code numbers representing
more serious offenses - the full Crime Code encoding is available at
https://data.lacity.org/api/views/63jg-8b9z/files/fff2caac-94b0-4ae5-9ca5-
d235b19e3c44?download=true&filename=UCR-COMPSTAT062618.pdf. “Crime_Weighted_Norm”
was obtained by applying two distinct transformations to the crime code associated with each record:
first the scale was inverted, so that a higher number signified a more serious crime. This was achieved
by dividing 100 (the lowest/most serious crime code) by the crime code of each data point, and as more
serious crimes have a lower crime code, those instances will have a smaller divisor, and thus generate
a larger number. The scale inversion was applied primarily to simplify understanding, as a high score
on a crime metric is more easily associated with a more negative situation. Once this scaling was
obtained, the values were then normalized, resulting in a feature in the [0,1] range, which simplifies the
understandability of the data. Other columns on the original dataset amounted essentially to the full
public police log on each crime, but contained no other relevant information for this project and were
discarded. Lastly, step five was storing the final dataset into MySQL.