Create_UK_Crime_Dataset

Mining Association Rules in Distibuted Databases

Matheus Schmitz
LinkedIn
Github Portfolio

Problem Statement

My intent for this project is to use Spark (this time in combination with R) to mine association rules using data mining techniques. This of couse requires tools suited for Big Data approaches, hence Spark is chosen as it allows for the manipulation of large datasets distributed across multiple computing nodes.

Big data presents certain hinderances to neural networks and other gradient descent learning approaches, while being more friendly (less unfriendly?) to techniques that are more easily parallelized, such as those employed when mining association rules. Hence, very commonly such approaches are used to analyse very large datasets, the practice of which is my goal here. Among the multiple algolrithms available, I'll focus on one which is widely regarded as being among the best: the Frequent Pattern Growth Algorithm.

For this project I'll be using crime data available from the UK Police's Open Data Portal, which contains a variety of records on all registered crimes. The data is available from 2014 onwards, although I've chosen to work with two years of data, from january 2019 to december 2020, which allow for a control and a test group for exploring the impacts of covid-19 on crime patterns.

Data Source: https://data.police.uk/data/

Dataset Creation

There are 44 or 45 CSVs files per month (one per region), considering 24 months the expected number of files was between 1056 and 1080, so seems like we got them all!

https://stackoverflow.com/questions/59552212/choosing-support-and-confidence-values-with-ml-fpgrowth-in-sparklyr

Mining_Association_Rules_Sparklyr

Mining_Association_Rules_Sparklyr

R Envrionment Setup

Exploratory Analysis

Visualizing Crimes with Leaflet!

View the interactive map here: Dispersion of Crimes in the UK

Interactive Map with Clustering by Region

View the interactive map here: Crime Clustering by Region

Data Cleaning

Focusing on one City

We have very few samples for a single city, let's move back to the full dataset.

Current Investigation Status of Crimes

TabulateCrime Types

Indeed all Anti-Social Behavior crimes go unresolved.

Feature Engineering

In order to apply Spark's FP-Growth algorithm (to mine association rules) the data must be in the correct format, which includes converting from wide format to long format, adn collecting each element to a list.

The 4 elements on the list are [LSOA name, Location, Crime type, Last outcome category]

Mining Association Rules

Using the FP-Growth algorithm, which is an improved version of the famous A-Priori algorithm.

Visualize the Association Rules

View the interactive Association Rule Graph here: Association Rules Graph
(This one takes a bit of time to load)

End

Matheus Schmitz
LinkedIn
Github Portfolio