Los Angeles Neighborhood Score

Final Report

Matheus Schmitz, Jameson Thai, Yeonsoo Park

Project Presentation Video: Watch Here

Introduction

The City of Los Angeles, home to nearly 4 million residents, is a trailblazer in many aspects, including

data transparency. The city’s open data portals contain publicly accessible datasets covering a

multitude of aspects of life in the city, such as environment, economy, libraries and park, public safety,

public services and more, with many datasets being updated weekly or even daily. This abundance of

data was the inspiration for the team’s project, which envisions the creation of a comprehensive

neighborhood quality of life comparison score, along with an accompanying visualization tool, allowing

those interested in learning more about the city or seeking to move into one of its neighborhoods to

easily access relevant information. The team behind the project is composed by Matheus Schmitz

, a

Product Manager with a degree in Industrial Engineering, along with Jameson Thai

and Emma Park

each with a background in Computer Science specializing in Data Science.

Along with the datasets provided by the official LA City webpage, other official sources were used to

complement the projects portfolio of neighborhood evaluation metrics. In aggregate, the datasets used

in this project come from the following websites: data.lacity.org, data.lacounty.gov, Geohub, cde, Zillow,

USC’s Socrata, and US Census Bureau. The extensive number of topics covered by these websites

allows for countless dimensions in which to analyse the data, however for this project the primary focus

will be on three themes: crime, education and housing. The main source for education information is

Geohub, a data visualization tool from the LA City open data portal, whereas the crime data comes

directly from the City of Los Angeles open data portal, and housing data is sourced from Zillow and

USC’s Socrata.

Project Timeline

Throughout the academic semester the team was able to stay ahead of schedule in developing the

project, which by the midterm report was already 80% complete. The team’s effective progress can be

attributed to diligent management of the project’s schedule and high commitment by the team’s

members.

"Matheus Schmitz - Innovation Associate - FCC | LinkedIn." https://www.linkedin.com/in/matheusschmitz.

"Jameson Thai - Course Producer - University of Southern ...." https://www.linkedin.com/in/jamesonthai.

"Emma (Yeonsoo) Park - Course Grader - LinkedIn." https://www.linkedin.com/in/-emma-park.

Following the team’s assembly, many potential project themes were debated, with the consensus being

on the Los Angeles Neighborhood Score vision as the team’s favorite. Next, each of the three topics of

focus was assigned to one team member, who was responsible for downloading the selected dataset,

preprocessing it on a as-needed basis, then uploading it to a MySQL database hosted on AWS’ EC2.

Below is a sample CSV containing crime data after it was processed and made ready for upload into

MySQL.

Data Processing

One critical task was obtaining the neighborhood associated with each register in all gathered datasets,

as neighborhoods were envisioned as the project’s final indexer for the obtained quality of life scores.

The housing data happened to already come with neighborhood information in all its data points,

dispensing any preprocessing to obtain them. The education dataset’s only georeference was school

names, which were processed in the google maps API to obtain each school’s neighborhood. Because

some schools had identical names to other schools outside of Los Angeles, this approach also

generated data from neighborhoods outside of Los Angeles, which had to be removed with a second

filter that allowed only LA neighborhoods through. To avoid data loss this filter was applied only later in

the process, at the Spark stage. Lastly, the crime data came with the latitude and longitude associated

with each record, which were also converted into neighborhoods through the google maps API.

Neighborhoods can be extracted from the google maps python API using the reverse_geocode()

method, which takes a tuple of (latitude, longitude) and extracts all information related to that point in

the globe. This information includes country, state, county, city, neighborhood and well as additional

information for certain locations. As the data provided varies from one place to another, so does the

structure of the API’s output, and hence it’s necessary to explore all elements in the json returned by

the API until the element containing neighborhood data is found. Then the neighborhood name can be

extracted and appended to the record from whose (latitude, longitude) tuple it originated. Below is the

resulting geomap colored by neighborhood.

Feature Extraction

Feature extraction was a complex part of this project, as not only was it necessary to obtain

neighborhoods for all records, it was also necessary to create scoring metrics for each of the prioritized

topics, each of which required a customized and thorough approach, examples of which are given in

the following paragraphs. For the housing and education data, after the scoring features were obtained,

the data points were then aggregated by neighborhood, with the scores being averaged. With the

preprocessing finished, the data was then saved into a CSV that is later read by a Python program into

a MySQL database.

Data Wrangling

The housing dataset solely references Zillow’s research dataset on housing. This dataset contains time

series data on the housing prices in various neighborhoods across the United States. This specific

score was computed using Zillow’s internal evaluation, Zestimate. There are also additional identifiers

such as SizeRank, RegionName, RegionType, StateName, State, Metro and CountyName. For our

task, we solely looked at the housing prices in the 2019 timeframe while querying the RegionName that

contains the neighborhood type based on State being California and Metro and CountyName containing

Los Angeles. Since the neighborhoods are already included in the results, there is no need to utilize

Google Maps API or the KNN algorithm to find or classify neighborhoods.

An interesting neighborhood comparison metric is the strength of the schools in each area. For this

evaluation the ACT scores were obtained from the California Department of Education website which

contains the school name and ACT average score for four sections - Reading, English, Math, Science.

This raw data needed to be processed and cleaned to extract information features using the following

workflow: First, utilize the google maps API to reverse school name to geolocation. Second, remove

the record if the geolocation couldn't be found by google maps API. Third, aggregate the records based

on neighborhoods. Contrary to the two other two datasets (Crime and Housing), ACT datasets contain

only 211 records because there are around 200 high schools in LA.

The pipeline for getting the crime data to MySQL consisted of the following steps: First the dataset

“Crime_Data_from_2010_to_2019.csv” was downloaded from the https://data.lacity.org/A-Safe-

City/Crime-Data-from-2010-to-2019/63jg-8b9z page. The whole dataset contained 2 million records,

covering years 2010-2019, using all of which proved too burdensome, hence only records for the year

of 2019 were kept, totalling about 200k records. Those records already contained latitude and

longitude, hence the second step consisted of using the google maps API reverse geocode method

and obtaining the neighborhood in which the crime happened. Unfortunately the API has limitations,

which stopped the queries after around 100k had been sent. This forced the team to resort to using a

KNN algorithm to predict the neighborhoods for the rest of the data, which was step three. A train-test

split showed this approach to be highly effective, as neighborhoods are by definition linearly separable

classes in a low (two) dimensional space, about as ideal a setting for a KNN algorithm as one could

hope for, resulting in a classification accuracy of 97%. The KNN algorithm’s excellent results can be

verified in the image below, which visually compares the neighborhoods obtained from the Google

Maps API (about 100K samples, all of which were used to train the algorithm) , and the neighborhoods

from KNN (about 100k samples). Step four was the creation of a feature called

“Crime_Weighted_Norm”, which was obtained by converting the “Crime Code” associated with each

more serious offenses - the full Crime Code encoding is available at

https://data.lacity.org/api/views/63jg-8b9z/files/fff2caac-94b0-4ae5-9ca5-

d235b19e3c44?download=true&filename=UCR-COMPSTAT062618.pdf. “Crime_Weighted_Norm”

was obtained by applying two distinct transformations to the crime code associated with each record:

first the scale was inverted, so that a higher number signified a more serious crime. This was achieved

by dividing 100 (the lowest/most serious crime code) by the crime code of each data point, and as more

serious crimes have a lower crime code, those instances will have a smaller divisor, and thus generate

a larger number. The scale inversion was applied primarily to simplify understanding, as a high score

on a crime metric is more easily associated with a more negative situation. Once this scaling was

obtained, the values were then normalized, resulting in a feature in the [0,1] range, which simplifies the

understandability of the data. Other columns on the original dataset amounted essentially to the full

public police log on each crime, but contained no other relevant information for this project and were

discarded. Lastly, step five was storing the final dataset into MySQL.

Store Individual Datasets

The MySQL database is hosted on an AWS EC2 instance. ID is a primary key and neighborhood names

are a secondary key, with the neighborhood scores and other extraneous information as table

attributes, all of which are shown in the image below. The extraneous information varied by dataset

due to the what types of information that could be extrapolated from.

Data Merging - SparkSQL

In order to diversify the number of Spark use cases in the project, with the goal of enhancing the team’s

learning, we opted to have the crime data be saved on MySQL as individual data points (200k data

points), instead of storing data already aggregated by neighborhood. This allows for Spark to be used

to MapReduce the crime data and aggregate it by neighborhood in a more scalable fashion (as well as

allowing for extra learning). The Spark MapReduce pipeline took the “Crime_Weighted_Norm” feature

that had already been extracted from the data prior to upload into MySQL, and averaged it by

neighborhood, while also obtaining a second feature: the count of crimes per neighborhood, obtained

by tallying the number of records registered in each neighborhood.

With the processing of the crime data finalized, the two other datasets, which had already been cleaned

prior to being stored in MySQL, were loaded into Spark. At this stage all datasets were already

aggregated by neighborhood, and thus could be merged using Spark SQL. A total of four Spark SQL

dataframes were joined with the “fullouter” method, all of which were already indexed by

neighborhoods. Those dataframes are (1) one containing the “Crime_Weighted_Norm”, renamed as

“CrimeScore” to simplify comprehension, (2) another with the tally of crimes per neighborhood, a feature

named “CrimeCount”, (3) a third with both features from the housing dataset (“HousingScore” and

“SizeRank”), and (4) lastly a dataframe with all ACT scores (“AvgACT”, “AvgScrEng”, “AvgScrMath”,

“AvgScrRead”, “AvgScrSci”). Thus up to nine features were obtained for neighborhoods that were

contained in the three datasets, with some missing values being generated for neighborhoods

inexistant in one of the datasets. Once the aggregation pipeline grouped all datasets using the join

function, the resulting tabular dataset was then converted into JSON (dictionary) format, making the

data suitable for upload to Firebase.

Firebase Architecture

In order to ease querying into the database, the Firebase architecture created was inspired on the

homework 1 architecture, with each neighborhood score feature being stored twice: First following the

index of {Feature: {Neighborhood_1: Value, Neighborhood_2, Value, …}} and second using the

Neighborhood as the main index on a NeighborhoodData node, as in {NeighborhoodData:

{Neighborhood_1: {Feature_1: Value, Feature_2: Value, …}, Neighborhood_2: {Feature_1: Value,

Feature_2: Value, …}...}}. Both indexing structures can be seen in the figures below:

Benefiting from the dual-storage architecture of the Firebase database, three different search functions

were developed, to most efficiently handle different scenarios. The first function is designed to deftly

retrieve all information from one or multiple neighborhoods using the NeighborhoodData node in

Firebase. The second function focuses on the inverse use case, whereby one or more metrics are

informed and the score of each neighborhood is returned. The third function allows for dual filtering,

where a set of neighborhoods and a set of features are given, and the matrix of resulting values is

obtained. With regards to the backend implementation, all functions return a pandas dataframe

containing the features as columns and the neighborhoods as rows. Such structure was chosen to

simplify the ensuing data visualization step.

Web Application - Search Function

With an abundance of information, it was necessary to devise a method to inform users about the

various neighborhoods. This was achieved by deploying a web-based infographic using Python Flask

as our backend and basic html, css, and javascript to display the information. We utilized two types of

Plotly mediums to visualize the dataset; bar charts and choropleth maps. Additionally we have a list of

neighborhoods in Los Angeles that have three checkboxes representing the selection for the crime,

housing, and ACT datasets. Those selections map to their corresponding graphs with crime generating

a crime score and crime count, housing generating a housing score and ACT generating the aggregated

average, science, reading, math, english scores. The horizontal lines across each of housing and crime

graphs represent the averages across all neighborhoods in Los Angeles. The horizontal line in the ACT

graph represents the average of the average ACT scores across all neighborhoods in Los Angeles.

The other type of medium utilized to visualize the dataset is a modified choropleth map that displays

queried neighborhoods as yellow and neighborhoods not queried as blue purple color. To represent

the neighborhood boundaries, the data was sourced from USC’s DataSocrata that has Los Angeles

neighborhoods in a GeoJson.

Workflow Summary

A brief summarization of the project’s overall workflow is: first datasets are cleaned; second

neighborhoods and other features are extracted; third datasets are individually aggregated by

neighborhood; fourth datasets are uploaded to MySQL; fifth data is read from MySQL to Spark; sixth

neighborhoods are used as index to group all datasets; seventh a filter is applied to ensure only

neighborhoods in the City of Los Angeles move forward; eight data is loaded to Firebase; ninth a search

program processes a set of neighborhoods and features from a user; tenth data associated with the

query is displayed to the user. The complete data pipeline for this project can be seen in the architecture

diagram below.

Architecture Diagram

Final Words

This project proved to be a great source of learning for the team, as it’s focus on multiple datasets, from

multiple sources, with multiple data processing and storage requirements allowed for experimentation

with a variety of tools. Selecting a realistic use case strongly contributed to the team’s engagement and

continued interest, as well as to the applicability of the learnings. There were several learning

opportunities for the team, among the most significant ones were the challenges each team member

faced in extracting neighborhood from his/her designated dataset, resulting in the need to use both

google maps API as well as KNN. Not only did this step require ingenuity, it also fostered team

collaboration, as one student’s solution could be adapted and replicated for the whole team. And

perhaps the most unexpected learning experience came even before all of those, when the team had

to consider the rules and expectations for the project and conceive an approach which would combine

learning class-related topics with creating a realistic application and finding an exciting theme to work

on, all of which were key contributors to this project’s level of success. An outstanding dynamic among

team members as well as the team’s zealous organization of tasks allowed all requirements to be met

without any peak in workload - a common source of lower quality deliverables. Although the formal

delivery of the project is complete, the initial vision points to potential further developments, as more

quality of life aspects can be easily incorporated into the project using the now-fleshed-out data

processing pipeline.

Demo Video

https://drive.google.com/file/d/1mKvB7o6plAoONjIHIXIaKgQhE7tpPo8-/view?usp=sharing

Below are links to all data sources used.

Housing

https://www.zillow.com/research/data/

https://usc.data.socrata.com/Los-Angeles/Rent-Price-LA-/4a97-v5tx

https://data.census.gov

School

https://www.cde.ca.gov/ds/sp/ai/

https://geohub.lacity.org/datasets/70baf6da243e40298ba9246e9a67409b_0/data

Crime

https://data.lacity.org/A-Safe-City/Crime-Data-from-2010-to-2019/63jg-8b9z

GeoJson for Los Angeles

https://usc.data.socrata.com/dataset/Los-Angeles-Neighborhood-Map/r8qd-yxsr

Our Firebase dataset can be found at:

https://neighborhood-score-la.firebaseio.com/.json

Github Repository

https://github.com/JamesonThai/DSCI551

Project Video

https://drive.google.com/file/d/1mKvB7o6plAoONjIHIXIaKgQhE7tpPo8-/view