Marijuana_Crime_01-hidden_key

Project Specification

In 2016 California voters aproved Proposition 64, legalizing recreative marijuana usage. More information about the Proposal can be found at:

California Proposition 64, Marijuana Legalization (2016)

Those opposing the Proposal presented five main objections:

  • (1) It would double the number of highway deaths;
  • (2) It would increase the cultivation or marijuana near schools and parks;
  • (3) It would increase the black market and cartel activity;
  • (4) It would harm poor communities with dependency problems due to inceased availability;
  • (5) It would increase criminality over time, specially in the areas marijuana dispensaries.

The goal of this project is to analyze such claims through time series and geospacial analytical tools, to assess whether those predictions became true since legalization. The focus will be on marijuana related detensions and in the general crime rate near dispensaries. Claims 2, 4 and 5 will be assessed.

Data Sources:

  • 1) Data regarding marijuana dispensaries in Los Angeles: Can be obtained through the Yelp API.
  • 2) Data regarding Los Angeles crimes: Can be obtained through the Los Angeles Open Data Portal.
  • 3) Data regarding arrest in Los Angeles: Can be obtained through the same portal.
  • 4) Demographic data from Los Angeles schools: Are pubicly available on Los Angeles School Information Branch.

All datasets have been separatedly downloaded and will be loaded from my GitHub to ensure forward compatibility.

Load Packages

In [1]:
#!pip install -q yelp
In [2]:
#!pip install -q yelpapi
In [3]:
# Disabling the multiple messages generated by the new versions from Pandas and Matplotlib
import sys
import warnings
import matplotlib.cbook
if not sys.warnoptions:
    warnings.simplefilter('ignore')
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.filterwarnings('ignore', category=FutureWarning)
warnings.filterwarnings('ignore', category=matplotlib.cbook.mplDeprecation)

# Data manipulation imports
import re
import json
import time
import numpy as np
import pandas as pd
from yelp.client import Client
from yelpapi import YelpAPI

# File manipulation imports for Google Colab
#from google.colab import drive
#drive.mount('/content/drive')
#import os
#os.chdir("/content/drive/My Drive/Colab Notebooks/Marijuana_Crime")
In [4]:
!ls
Marijuana_Crime_1.ipynb
files

Dataset 1

Obtaining from Yelp data about Los Angeles dispensaries

In [5]:
# Yelp API key
# Link to get my API key: https://www.yelp.com/developers/v3/manage_app?app_created=True
my_api = ''
In [6]:
# Connect to Yelp
client_access = Client(my_api)
In [7]:
# Function to format the file with the API queries
def format_file(file_path,
                logfile = './files/log_file.txt',
                file_description = None):
  
  # Applying regular expressiosn to clean the file name
  # Adjusting the file extension
    try:
        ext = re.search('(?<!^)(?<!\.)\.(?!\.)', file_path).start()
    except:
        raise NameError('File could not be found in this path.')

    # Adjusting the timestamp for the file name
    try:
        stamp = re.search('(?<!^)(?<!\.)[a-z]+_[a-z]+(?=\.)', file_path).start()
    except:
        raise NameError('File could not be found in this path.')

    # Format the file name adding the timestamp
    formatted_name = f'{file_path[:stamp]}{round(time.time())}_{file_path[stamp:]}' 

    # In case there is no file description, generate one
    if not file_description:
        file_description = f'File generated in: {time.asctime(time.gmtime(round(time.time())))}'

    # Opening the log file and storing the formatted data file and its description
    with open(logfile, 'a+') as f:
        f.write(f'{formatted_name}: {file_description}\n')

    # Returning the formatted data file and its description
    return formatted_name, file_description
In [8]:
# Connecting to the Yelp API and retrieving 1000 samples
# That is, 1000 registers from stores classified as 'dispensaries'
def yelp_search(category,
                location,
                offset_number = 0,
                n_samples = 1000):
  
    # API
    yelp_api = YelpAPI(my_api)

    # Register from the last result
    last_result = round(time.time())

    # List to store results
    results = []

    # Size
    size = 50

    # Initialize the loop count
    loops = 0

    # Initialize the runs count
    run = 1

    # Initialize offset
    offset_count = offset_number

    # Loop to retrieve data:
    while loops < n_samples:
    
        print(f'Initializing query {run}')

        # Query
        posts = yelp_api.search_query(categories = category,
                                  location = location,
                                  offset = offset_count,
                                  limit = size)
    
        # Business related posts
        results.extend(posts['businesses'])

        # Increment the loop count
        loops += size

        # Increment offset
        offset_count += size

        # Wait 3 seconds to run next query
        time.sleep(3)

        # Increment runs
        run += 1

    # Once the loop is finished, obtain the formatted file name and description
    formatted_name, file_description = format_file(file_path = f'./files/file_{category}.json')

    # Opening the formatted file and dumping the query results in json format
    with open(formatted_name, 'w+') as f:
        json.dump(results, f)

    print(f'\nQuery finished. Number of stores found: {len(results)} {category}.')

    global timestamp
    timestamp = round(time.time())

    return print(f'\nThe last timestamp was: {timestamp}.')
In [9]:
# Run query with business category and city
yelp_search('cannabisdispensaries', 'los angeles', n_samples = 1000)
Initializing query 1
Initializing query 2
Initializing query 3
Initializing query 4
Initializing query 5
Initializing query 6
Initializing query 7
Initializing query 8
Initializing query 9
Initializing query 10
Initializing query 11
Initializing query 12
Initializing query 13
Initializing query 14
Initializing query 15
Initializing query 16
Initializing query 17
Initializing query 18
Initializing query 19
Initializing query 20

Query finished. Number of stores found: 297 cannabisdispensaries.

The last timestamp was: 1589727546.
In [10]:
# Open the JSON file, read it, and generat the final list
with open(f'files/{timestamp}_file_cannabisdispensaries.json', 'r') as f:
    la_dispensaries = json.load(f)
In [11]:
# Checking
la_dispensaries[5]
Out[11]:
{'id': 'aAagpdwqDYNfVtlH39Csbg',
 'alias': 'kushfly-los-angeles-5',
 'name': 'Kushfly',
 'image_url': 'https://s3-media1.fl.yelpcdn.com/bphoto/7LaiBUBA_t2w2irxevaunQ/o.jpg',
 'is_closed': False,
 'url': 'https://www.yelp.com/biz/kushfly-los-angeles-5?adjust_creative=1vpy6OEt4U7reaEsVwesKw&utm_campaign=yelp_api_v3&utm_medium=api_v3_business_search&utm_source=1vpy6OEt4U7reaEsVwesKw',
 'review_count': 229,
 'categories': [{'alias': 'cannabisdispensaries',
   'title': 'Cannabis Dispensaries'}],
 'rating': 4.0,
 'coordinates': {'latitude': 34.127617, 'longitude': -118.34671},
 'transactions': [],
 'price': '$$',
 'location': {'address1': '',
  'address2': None,
  'address3': None,
  'city': 'Los Angeles',
  'zip_code': '90068',
  'country': 'US',
  'state': 'CA',
  'display_address': ['Los Angeles, CA 90068']},
 'phone': '+18555710420',
 'display_phone': '(855) 571-0420',
 'distance': 7703.481319255708}
In [12]:
# Function to create a dataframe with the dispensaries data
def organize_data(stores_list, df_name = 'df_stores'):
    
    # Convert the list to a dataframe
    df_name = pd.DataFrame(stores_list)

    # List with the desired columns
    col_list = ['name',
                'is_closed',
                'url',
                'rating',
                'price',
                'review_count']

    # Filter the dataframe to have only the desired columns
    df_name = df_name[col_list]

    return df_name
In [13]:
# Apply the function to create the dataframe
df_stores = organize_data(la_dispensaries)
In [14]:
df_stores.shape
Out[14]:
(297, 6)
In [15]:
df_stores.head()
Out[15]:
name is_closed url rating price review_count
0 California Caregivers Alliance False https://www.yelp.com/biz/california-caregivers... 4.5 NaN 248
1 Herbarium False https://www.yelp.com/biz/herbarium-west-hollyw... 4.5 $$ 230
2 MedMen Los Angeles - DTLA False https://www.yelp.com/biz/medmen-los-angeles-dt... 4.0 $$ 303
3 Green Earth Collective False https://www.yelp.com/biz/green-earth-collectiv... 4.5 $$ 183
4 HERB False https://www.yelp.com/biz/herb-los-angeles-3?ad... 4.5 $$ 99
In [16]:
# Extracting latitudes
latitude_list = [la_dispensaries[i]['coordinates']['latitude'] for i in range(len(la_dispensaries))]
In [17]:
# Extracting longitudes
longitude_list = [la_dispensaries[i]['coordinates']['longitude'] for i in range(len(la_dispensaries))]
In [18]:
# Add geolocation to dataframe
df_stores['latitude'] = latitude_list
df_stores['longitude'] = longitude_list
In [19]:
df_stores.head()
Out[19]:
name is_closed url rating price review_count latitude longitude
0 California Caregivers Alliance False https://www.yelp.com/biz/california-caregivers... 4.5 NaN 248 34.082350 -118.272037
1 Herbarium False https://www.yelp.com/biz/herbarium-west-hollyw... 4.5 $$ 230 34.088530 -118.344600
2 MedMen Los Angeles - DTLA False https://www.yelp.com/biz/medmen-los-angeles-dt... 4.0 $$ 303 34.044600 -118.254440
3 Green Earth Collective False https://www.yelp.com/biz/green-earth-collectiv... 4.5 $$ 183 34.122470 -118.210670
4 HERB False https://www.yelp.com/biz/herb-los-angeles-3?ad... 4.5 $$ 99 34.043499 -118.250206
In [20]:
# Create a location column with a tuple containing latitude and longitude
df_stores['location'] = list(zip(df_stores['latitude'], df_stores['longitude']))
In [21]:
df_stores.head()
Out[21]:
name is_closed url rating price review_count latitude longitude location
0 California Caregivers Alliance False https://www.yelp.com/biz/california-caregivers... 4.5 NaN 248 34.082350 -118.272037 (34.08235, -118.272037)
1 Herbarium False https://www.yelp.com/biz/herbarium-west-hollyw... 4.5 $$ 230 34.088530 -118.344600 (34.08853, -118.3446)
2 MedMen Los Angeles - DTLA False https://www.yelp.com/biz/medmen-los-angeles-dt... 4.0 $$ 303 34.044600 -118.254440 (34.0446, -118.25444)
3 Green Earth Collective False https://www.yelp.com/biz/green-earth-collectiv... 4.5 $$ 183 34.122470 -118.210670 (34.12247, -118.21067)
4 HERB False https://www.yelp.com/biz/herb-los-angeles-3?ad... 4.5 $$ 99 34.043499 -118.250206 (34.0434989929199, -118.250205993652)
In [22]:
df_stores.shape
Out[22]:
(297, 9)
In [23]:
# Saving the dataset
df_stores.to_csv('files/df_stores.csv')

Dataset 2

Obtaining data from Los Angeles crimes between 2010 and 2019. Dataset downloaded from LA Open Data Portal. Step-by-step to download it is:

  • 1) Access: https://data.lacity.org/
  • 2) Search: "Crime Data"
  • 3) Choose: "Crime Data from 2010 to 2019"
  • 4) Select: "View Data" -> "Export" -> "Download" -> "CSV"

Data dictonary: https://data.lacity.org/A-Safe-City/Crime-Data-from-2010-to-2019/63jg-8b9z

I've uploaded the CSV to the ./files folder in this project's directory.

In [24]:
# Loading the CSV with crime data
df_crimes = pd.read_csv('files/Crime_Data_from_2010_to_2019.csv')
In [25]:
df_crimes.shape
Out[25]:
(2114238, 28)
In [26]:
df_crimes.head(3)
Out[26]:
DR_NO Date Rptd DATE OCC TIME OCC AREA AREA NAME Rpt Dist No Part 1-2 Crm Cd Crm Cd Desc ... Status Status Desc Crm Cd 1 Crm Cd 2 Crm Cd 3 Crm Cd 4 LOCATION Cross Street LAT LON
0 1307355 02/20/2010 12:00:00 AM 02/20/2010 12:00:00 AM 1350 13 Newton 1385 2 900 VIOLATION OF COURT ORDER ... AA Adult Arrest 900.0 NaN NaN NaN 300 E GAGE AV NaN 33.9825 -118.2695
1 11401303 09/13/2010 12:00:00 AM 09/12/2010 12:00:00 AM 45 14 Pacific 1485 2 740 VANDALISM - FELONY ($400 & OVER, ALL CHURCH VA... ... IC Invest Cont 740.0 NaN NaN NaN SEPULVEDA BL MANCHESTER AV 33.9599 -118.3962
2 70309629 08/09/2010 12:00:00 AM 08/09/2010 12:00:00 AM 1515 13 Newton 1324 2 946 OTHER MISCELLANEOUS CRIME ... IC Invest Cont 946.0 NaN NaN NaN 1300 E 21ST ST NaN 34.0224 -118.2524

3 rows × 28 columns

In [27]:
df_crimes.isnull().sum()
Out[27]:
DR_NO                   0
Date Rptd               0
DATE OCC                0
TIME OCC                0
AREA                    0
AREA NAME               0
Rpt Dist No             0
Part 1-2                0
Crm Cd                  0
Crm Cd Desc             0
Mocodes            227959
Vict Age                0
Vict Sex           196652
Vict Descent       196699
Premis Cd              53
Premis Desc           187
Weapon Used Cd    1404139
Weapon Desc       1404140
Status                  3
Status Desc             0
Crm Cd 1               10
Crm Cd 2          1975051
Crm Cd 3          2110747
Crm Cd 4          2114134
LOCATION                0
Cross Street      1758896
LAT                     0
LON                     0
dtype: int64
In [28]:
# Substituting the spaces in the columns' names for underscores
# To simplify filtering/indexation
df_crimes.columns = [column.lower().replace(' ', '_') for column in df_crimes.columns]
In [29]:
# Dropping unnecessary columns
df_crimes.drop(labels = ['crm_cd_1', 
                         'crm_cd_2', 
                         'crm_cd_3', 
                         'crm_cd_4',
                         'premis_cd',
                         'premis_desc', 
                         'vict_descent', 
                         'vict_sex',
                         'status',
                         'dr_no', 
                         'area_',
                         'date_rptd',
                         'rpt_dist_no',
                         'crm_cd',
                         'part_1-2',
                         'mocodes',
                         'cross_street',
                         'weapon_used_cd',
                         'status_desc',
                         'time_occ',
                         'vict_age'],
               axis = 1,
               inplace = True)
In [30]:
# Keeping the weapon_desc column and filling NAs with 'unknown'
df_crimes.weapon_desc.fillna('unknown', inplace = True)
In [31]:
df_crimes.isnull().sum()
Out[31]:
date_occ       0
area_name      0
crm_cd_desc    0
weapon_desc    0
location       0
lat            0
lon            0
dtype: int64
In [32]:
df_crimes.head(3)
Out[32]:
date_occ area_name crm_cd_desc weapon_desc location lat lon
0 02/20/2010 12:00:00 AM Newton VIOLATION OF COURT ORDER unknown 300 E GAGE AV 33.9825 -118.2695
1 09/12/2010 12:00:00 AM Pacific VANDALISM - FELONY ($400 & OVER, ALL CHURCH VA... unknown SEPULVEDA BL 33.9599 -118.3962
2 08/09/2010 12:00:00 AM Newton OTHER MISCELLANEOUS CRIME unknown 1300 E 21ST ST 34.0224 -118.2524
In [33]:
# Shape
df_crimes.shape
Out[33]:
(2114238, 7)
In [34]:
# Saving the dataframe to a CSV file
df_crimes.to_csv('files/df_crimes.csv')

Dataset 3

Obtaining data from Los Angeles arrests from 2010 onwards. Dataset downloaded from LA Open Data Portal. Step-by-step to download it is:

1) Access: https://data.lacity.org/ 2) Search: "Arrests" 3) Choose: "Arrest Data from 2010 to Present" 4) Select: "View Data" -> "Export" -> "Download" -> "CSV" Data dictonary: https://data.lacity.org/A-Safe-City/Arrest-Data-from-2010-to-Present/yru6-6re4

I've uploaded the CSV to the ./files folder in this project's directory.

In [35]:
# Loading the Arrests CSV
df_arrests = pd.read_csv('files/Arrest_Data_from_2010_to_Present.csv')
In [36]:
df_arrests.shape
Out[36]:
(1350103, 17)
In [37]:
df_arrests.head(3)
Out[37]:
Report ID Arrest Date Time Area ID Area Name Reporting District Age Sex Code Descent Code Charge Group Code Charge Group Description Arrest Type Code Charge Charge Description Address Cross Street Location
0 200110044 03/27/2020 2125.0 1 Central 142 20 F W 13.0 Prostitution/Allied M 647(B)PC PROSTITUTION 400 S FIGUEROA ST NaN (34.0535, -118.256)
1 200110045 04/01/2020 1800.0 1 Central 166 44 F H 18.0 Drunkeness M 41.27(C)LAM DRINKING IN PUBLIC*** WINSTON ST SAN PEDRO ST (34.0421, -118.2469)
2 200110271 03/08/2020 1545.0 1 Central 166 61 M B 18.0 Drunkeness M 41.27(C)LAM DRINKING IN PUBLIC*** 6TH SAN JULIAN (34.0428, -118.2461)
In [38]:
df_arrests.isnull().sum()
Out[38]:
Report ID                        0
Arrest Date                      0
Time                           198
Area ID                          0
Area Name                        0
Reporting District               0
Age                              0
Sex Code                         0
Descent Code                     0
Charge Group Code            91986
Charge Group Description     92528
Arrest Type Code                 1
Charge                           0
Charge Description           91775
Address                          0
Cross Street                583097
Location                         0
dtype: int64
In [39]:
df_arrests.dtypes
Out[39]:
Report ID                     int64
Arrest Date                  object
Time                        float64
Area ID                       int64
Area Name                    object
Reporting District            int64
Age                           int64
Sex Code                     object
Descent Code                 object
Charge Group Code           float64
Charge Group Description     object
Arrest Type Code             object
Charge                       object
Charge Description           object
Address                      object
Cross Street                 object
Location                     object
dtype: object
In [40]:
# Converting the 'Arrest Date' column from String to Datetime, to facilitate manipulation
df_arrests['Arrest Date'] = pd.to_datetime(df_arrests['Arrest Date'])
In [41]:
# Using regular expressions to clean the 'Location' column and convert it to a list
df_arrests['Location'] = df_arrests['Location'].map(lambda x: re.sub('[(),°]', '', x)).str.split()
In [42]:
df_arrests.head(3)
Out[42]:
Report ID Arrest Date Time Area ID Area Name Reporting District Age Sex Code Descent Code Charge Group Code Charge Group Description Arrest Type Code Charge Charge Description Address Cross Street Location
0 200110044 2020-03-27 2125.0 1 Central 142 20 F W 13.0 Prostitution/Allied M 647(B)PC PROSTITUTION 400 S FIGUEROA ST NaN [34.0535, -118.256]
1 200110045 2020-04-01 1800.0 1 Central 166 44 F H 18.0 Drunkeness M 41.27(C)LAM DRINKING IN PUBLIC*** WINSTON ST SAN PEDRO ST [34.0421, -118.2469]
2 200110271 2020-03-08 1545.0 1 Central 166 61 M B 18.0 Drunkeness M 41.27(C)LAM DRINKING IN PUBLIC*** 6TH SAN JULIAN [34.0428, -118.2461]
In [43]:
# Extracting latitude and longitude
df_arrests['latitude'] = df_arrests['Location'].map(lambda x: x[0])
df_arrests['longitude'] = df_arrests['Location'].map(lambda x: x[1])
In [44]:
# Converting the geolocation to float
df_arrests['latitude'] = df_arrests['latitude'].map(lambda x: float(x))
df_arrests['longitude'] = df_arrests['longitude'].map(lambda x: float(x))
In [ ]:
# Converting the 'Charge Description' column to string and lower case
df_arrests['Charge Description'] = df_arrests['Charge Description'].map(lambda x: str(x))
df_arrests['Charge Description'] = df_arrests['Charge Description'].map(lambda x: x.lower())
In [ ]:
df_arrests.shape
In [ ]:
df_arrests.head(3)
In [ ]:
# Listing all arrest classifications
arrests_list = list(df_arrests['Charge Description'].value_counts().index.sort_values())
In [ ]:
# Setting to lower case to standardize
arrests_list = [x.lower() for x in arrests_list]
In [ ]:
# Checking all arrest classifications
arrests_list
In [ ]:
# Filtering all marijuana arrests
# Some descriptions were abbreviated to 'marij'
marijuana_arrests = [x for x in arrests_list if 'marij' in x]
In [ ]:
len(marijuana_arrests)
In [ ]:
marijuana_arrests
In [ ]:
# Creating a new df column to identify arrests which were marijuana related
df_arrests['marijuana_related'] = df_arrests['Charge Description'].map(lambda x: x if x in marijuana_arrests else np.NaN)
In [ ]:
df_arrests.head(3)
In [ ]:
# Counting non-NA values to see how many marijuana arrests there were
len(df_arrests[~df_arrests['marijuana_related'].isnull()])
In [ ]:
# Keeping only the marijuana related data
df_arrests = df_arrests[~df_arrests['marijuana_related'].isnull()]
In [ ]:
df_arrests.shape
In [ ]:
df_arrests.head(3)
In [ ]:
# Saving dataframe as csv
df_arrests.to_csv('files/df_arrests.csv')

Dataset 4

Obtaining geolocation data on Los Angeles schools.

Data sourced from: http://www.lausd.k12.ca.us/lausd/offices/bulletins/

File used: http://www.lausd.k12.ca.us/lausd/offices/bulletins/lausdk12.tab

In [ ]:
# Loading file
df_schools = pd.read_csv('files/lausdk12.tab', sep = '\t')
In [ ]:
df_schools.shape
In [ ]:
df_schools.head(3)
In [ ]:
# Merging Address + City + State + ZIP to create a new 'complete_address' column
df_schools['complete_address'] = df_schools['Address'] + ' ' + df_schools['City'] + ' ' + df_schools['State'] + ' ' + df_schools['Zip Code'].astype(str)
In [ ]:
df_schools['complete_address'] = df_schools['complete_address'].astype(str)
In [ ]:
# Deleting unnecessary columns
df_schools = df_schools.drop(['Address',
                             'City',
                             'State',
                             'Cost Center Code',
                             'Legacy Code',
                             'Telephone',
                             'Fax',
                             'Calendar',
                             'File Build Date'],
                            1)
In [ ]:
# Checking for duplicates
# This can happen if there are two schools registered on the same address, such as a kindergarten and middle school
df_schools = df_schools[~df_schools.duplicated(subset = 'complete_address')].sort_values('complete_address')
In [ ]:
# Resetting index
df_schools.reset_index(drop = True, inplace = True)
In [ ]:
df_schools.shape
In [ ]:
df_schools.head(3)
In [ ]:
# Saving the schools df
df_schools.to_csv('files/df_schools.csv')
Marijuana_Crime_02-hidden_key

Marijuana_Crime_02-hidden_key

Tasks for part 2:

  • Exploratory Data Analysis
  • Construction of Geolocation graphics, showing the exact location of each occurange
  • Segmenting the data in yearly periods
  • Calculating the distance between events based on Geolocation (use the geopy package)
  • Statistical summary of the variables
  • Analysis and interpretation of the variable's distributions
  • Normality test to verify whether the samples were extracted from a normally distributed population
  • Analysis and interpretation of the p-value
  • Conclusions from the exploratory analysis
In [1]:
# Disabling the multiple messages generated by the new versions from Pandas and Matplotlib
import sys
import warnings
import matplotlib.cbook
if not sys.warnoptions:
    warnings.simplefilter('ignore')
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.filterwarnings('ignore', category=FutureWarning)
warnings.filterwarnings('ignore', category=matplotlib.cbook.mplDeprecation)

# Data manipulation and visualization
import scipy
import numpy as np
import pandas as pd
import matplotlib as m
import matplotlib.pyplot as plt
from scipy import stats

# Graphics formatting
m.rcParams['axes.labelsize'] = 14
m.rcParams['xtick.labelsize'] = 12
m.rcParams['ytick.labelsize'] = 12
from matplotlib.pylab import rcParams 
rcParams['figure.figsize'] = 15,7
plt.style.use('fivethirtyeight')
%matplotlib inline

Exploratory Analysis

In [2]:
df_arrests = pd.read_csv('files/df_arrests.csv')
In [3]:
df_arrests.shape
Out[3]:
(20673, 21)
In [4]:
df_arrests.head()
Out[4]:
Unnamed: 0 Report ID Arrest Date Time Area ID Area Name Reporting District Age Sex Code Descent Code ... Charge Group Description Arrest Type Code Charge Charge Description Address Cross Street Location latitude longitude marijuana_related
0 340 5568617 2019-03-09 2015.0 6 Hollywood 646 29 M O ... Narcotic Drug Laws F 11359HS possession marijuana for sale CHEROKEE HOLLYWOOD ['34.1016', '-118.335'] 34.1016 -118.3350 possession marijuana for sale
1 342 5568629 2019-03-09 2015.0 6 Hollywood 646 25 M B ... Narcotic Drug Laws F 11359HS possession marijuana for sale CHEROKEE HOLLYWOOD ['34.1016', '-118.335'] 34.1016 -118.3350 possession marijuana for sale
2 983 5607178 2019-04-21 2115.0 7 Wilshire 702 42 M B ... Narcotic Drug Laws F 11359(D)HS poss/sale marij ovr 21 employ per 20/belw MELROSE AV FAIRFAX AV ['34.0838', '-118.3614'] 34.0838 -118.3614 poss/sale marij ovr 21 employ per 20/belw
3 1249 200608837 2020-03-17 2200.0 6 Hollywood 646 21 F B ... Narcotic Drug Laws I 11362.3A1HS smoke/ingest marijuana in public place HOLLYWOOD VINE ['34.1016', '-118.3267'] 34.1016 -118.3267 smoke/ingest marijuana in public place
4 1255 200608903 2020-03-19 2320.0 6 Hollywood 645 25 M W ... Narcotic Drug Laws I 11362.3A1HS smoke/ingest marijuana in public place HOLLYWOOD HIGHLAND ['34.1016', '-118.3387'] 34.1016 -118.3387 smoke/ingest marijuana in public place

5 rows × 21 columns

In [5]:
df_arrests.describe()
Out[5]:
Unnamed: 0 Report ID Time Area ID Reporting District Age Charge Group Code latitude longitude
count 2.067300e+04 2.067300e+04 20670.000000 20673.000000 20673.000000 20673.000000 20673.0 20673.000000 20673.000000
mean 5.421069e+05 5.327595e+07 1523.879971 10.980941 1146.309196 29.037489 16.0 34.053144 -118.304159
std 3.585493e+05 5.941124e+07 530.155180 6.080943 606.506447 11.246525 0.0 0.720001 2.471156
min 3.400000e+02 2.586000e+03 1.000000 1.000000 100.000000 11.000000 16.0 0.000000 -118.664700
25% 1.968940e+05 3.119415e+06 1245.000000 6.000000 645.000000 20.000000 16.0 33.993300 -118.434500
50% 5.307320e+05 4.815994e+06 1615.000000 12.000000 1239.000000 26.000000 16.0 34.048200 -118.330600
75% 8.068310e+05 1.103068e+08 1905.000000 16.000000 1656.000000 35.000000 16.0 34.167600 -118.276200
max 1.349967e+06 2.019007e+08 2358.000000 21.000000 2197.000000 79.000000 16.0 34.547700 0.000000

Checking the range of arrest dates:

In [6]:
# Minimum value
df_arrests['Arrest Date'].min()
Out[6]:
'2010-01-01'
In [7]:
# Maxium value
df_arrests['Arrest Date'].max()
Out[7]:
'2020-05-08'

Considering that the law came into force in 2017, and that therefore there are three years for which there is complete data of the after effects, I'll be restricting the analysis to 3 years before and 3 years after. In other words, the follows years will be considered: 2014, 2015, 2016, 2017, 2018, 2019.

In [8]:
# Filtering the dataframe to the 2014-2019 range
df_arrests = df_arrests[(df_arrests['Arrest Date'] >= '2014-01-01') & (df_arrests['Arrest Date'] <= '2019-12-31')]
In [9]:
df_arrests.shape
Out[9]:
(6226, 21)
In [10]:
# Saving the updated dataset
df_arrests.to_csv('files/df_arrests_02.csv', index = False)

Geolocation Plots

Plot of Arrests

In [11]:
# Filtering the dataframe per year
df_arrests_2014 = df_arrests[(df_arrests['Arrest Date'] >= '2014-01-01') & (df_arrests['Arrest Date'] <= '2014-12-31')]
df_arrests_2015 = df_arrests[(df_arrests['Arrest Date'] >= '2015-01-01') & (df_arrests['Arrest Date'] <= '2015-12-31')]
df_arrests_2016 = df_arrests[(df_arrests['Arrest Date'] >= '2016-01-01') & (df_arrests['Arrest Date'] <= '2016-12-31')]
df_arrests_2017 = df_arrests[(df_arrests['Arrest Date'] >= '2017-01-01') & (df_arrests['Arrest Date'] <= '2017-12-31')]
df_arrests_2018 = df_arrests[(df_arrests['Arrest Date'] >= '2018-01-01') & (df_arrests['Arrest Date'] <= '2018-12-31')]
df_arrests_2019 = df_arrests[(df_arrests['Arrest Date'] >= '2019-01-01') & (df_arrests['Arrest Date'] <= '2019-12-31')]
In [270]:
# Function to create the arrest geolocation plots
def geo_plot_arrests(dataframe, title):
    
    # Total
    print(f'Tally of Marijuana Related Arrests: {len(dataframe)}')
    
    # Plot area
    fig, ax = plt.subplots(figsize = (20,10))
    
    # Scatterplot
    plt.scatter(dataframe['longitude'], dataframe['latitude'],c = 'red', s = 12, alpha = 0.4, label = 'Arrests')

    # Labels and Legend
    plt.xlabel('Longitude', fontsize = 20)
    plt.ylabel('Latitude', fontsize = 20)
    plt.xticks(fontsize = 15)
    plt.yticks(fontsize = 15)
    plt.legend(fontsize = 18)
    plt.title(title, fontsize = 26)
    plt.tight_layout()
In [271]:
# Total arrests in 2014
geo_plot_arrests(df_arrests_2014, 'Total Arrests in 2014')
Tally of Marijuana Related Arrests: 1705
In [14]:
# Total arrests in 2015
geo_plot_arrests(df_arrests_2015, 'Total Arrests in 2015')
Tally of Marijuana Related Arrests: 1535
In [15]:
# Total arrests in 2016
geo_plot_arrests(df_arrests_2016, 'Total Arrests in 2016')
Tally of Marijuana Related Arrests: 1386
In [16]:
# Total arrests in 2017
geo_plot_arrests(df_arrests_2017, 'Total Arrests in 2017')
Tally of Marijuana Related Arrests: 528
In [17]:
# Total arrests in 2018
geo_plot_arrests(df_arrests_2018, 'Total Arrests in 2018')
Tally of Marijuana Related Arrests: 564
In [18]:
# Total arrests in 2019
geo_plot_arrests(df_arrests_2019, 'Total Arrests in 2019')
Tally of Marijuana Related Arrests: 508

The data and the plots show that there was a drop in arrests from 2014 to 2019, with a marked fall starting in 2017, the year the new marijuana laws came into force. Although correlation doesn't imply causation, this is nevertheless evidence that the law impacted marijuana arrests.

Plot of Arrests x Dispensaries

In [19]:
# Loading the dispensaries dataset
df_stores = pd.read_csv('files/df_stores.csv')
In [20]:
df_stores.shape
Out[20]:
(297, 10)
In [21]:
df_stores.head()
Out[21]:
Unnamed: 0 name is_closed url rating price review_count latitude longitude location
0 0 California Caregivers Alliance False https://www.yelp.com/biz/california-caregivers... 4.5 NaN 248 34.082350 -118.272037 (34.08235, -118.272037)
1 1 Herbarium False https://www.yelp.com/biz/herbarium-west-hollyw... 4.5 $$ 230 34.088530 -118.344600 (34.08853, -118.3446)
2 2 MedMen Los Angeles - DTLA False https://www.yelp.com/biz/medmen-los-angeles-dt... 4.0 $$ 303 34.044600 -118.254440 (34.0446, -118.25444)
3 3 Green Earth Collective False https://www.yelp.com/biz/green-earth-collectiv... 4.5 $$ 183 34.122470 -118.210670 (34.12247, -118.21067)
4 4 HERB False https://www.yelp.com/biz/herb-los-angeles-3?ad... 4.5 $$ 99 34.043499 -118.250206 (34.0434989929199, -118.250205993652)

Closed dispensaries are not meaningful to the analysis and can skew results. Removing them:

In [22]:
# Checking for closed stores
df_stores['is_closed'].value_counts()
Out[22]:
False    296
True       1
Name: is_closed, dtype: int64
In [23]:
# Checking the closed store's register
df_stores[df_stores['is_closed']]
Out[23]:
Unnamed: 0 name is_closed url rating price review_count latitude longitude location
234 234 CannaBuzz Store True https://www.yelp.com/biz/cannabuzz-store-los-a... 1.0 NaN 1 34.04124 -118.23775 (34.04124, -118.23775)
In [24]:
# Removing the register
df_stores.drop(df_stores.index[234], inplace = True)
In [25]:
# Checking for closed stores
df_stores['is_closed'].value_counts()
Out[25]:
False    296
Name: is_closed, dtype: int64

For the rest of my analysis, only the name, latitude and longitude variables will be necessary. Dropping the other variables:

In [26]:
# Filtering the dataframe
df_stores = df_stores[['name', 'latitude', 'longitude']]
In [27]:
df_stores.head()
Out[27]:
name latitude longitude
0 California Caregivers Alliance 34.082350 -118.272037
1 Herbarium 34.088530 -118.344600
2 MedMen Los Angeles - DTLA 34.044600 -118.254440
3 Green Earth Collective 34.122470 -118.210670
4 HERB 34.043499 -118.250206
In [28]:
# Statistical summary
df_stores.describe()
Out[28]:
latitude longitude
count 296.000000 296.000000
mean 34.055510 -118.313572
std 0.091237 0.134611
min 33.741532 -118.624410
25% 34.017272 -118.383839
50% 34.046515 -118.323353
75% 34.094220 -118.250082
max 34.310078 -117.455840
In [29]:
# Saving the filtered dataframe
df_stores.to_csv('files/df_stores_02.csv', index = False)
In [274]:
# Function to plot arrest location x store location
def geo_plot_02(dataframe_arrests, year):

    # Plot area
    fig, ax = plt.subplots(figsize = (25,15))
    
    # Scatterplot of Arrests
    plt.scatter(dataframe_arrests['longitude'], dataframe_arrests['latitude'], c = 'red', s = 12, label = 'Arrests')

    # Scatterplot of Dispensaries
    ax.scatter(df_stores['longitude'], df_stores['latitude'], c = 'b', s = 12, label = 'Dispensaries')

    # Labels and Legend
    plt.xlabel('Longitude', fontsize = 20)
    plt.ylabel('Latitude', fontsize = 20)
    plt.xticks(fontsize = 15)
    plt.yticks(fontsize = 15)
    plt.legend(fontsize = 18)
    plt.xlim(None, -118)
    plt.title(f'{year}: Arrests x Dispensaries', fontsize = 30)
    plt.tight_layout()
In [275]:
geo_plot_02(df_arrests_2014, 2014)
In [276]:
geo_plot_02(df_arrests_2015, 2015)
In [277]:
geo_plot_02(df_arrests_2016, 2016)
In [278]:
geo_plot_02(df_arrests_2017, 2017)
In [279]:
geo_plot_02(df_arrests_2018, 2018)
In [280]:
geo_plot_02(df_arrests_2019, 2019)

The arrests were clearly concentrated in areas close to dispensaries, both before and after legalization. Moreso, few if any stores changed places, and over the years more stores were opened. Therefore, considering that legalization didn't change the arrest locatations even though new stores were opened, it seems likely that arrest locations are not related to dispensary locations, but rather to other significant attributes of the city landscape (such as certain neighborhoods or popular areas).

Plot of Arrests x Schools

In [37]:
df_schools = pd.read_csv('files/df_schools.csv')
In [38]:
df_schools.shape
Out[38]:
(947, 7)
In [39]:
df_schools.head()
Out[39]:
Unnamed: 0 School Zip Code Grades Local District Board District complete_address
0 0 Frank Del Olmo Elementary 90004 K- 5 C 2-Monica Garcia 100 N New Hampshire Ave Los Angeles CA 90004
1 1 10th Street Elementary 90015 1- 5 C 2-Monica Garcia 1000 Grattan St Los Angeles CA 90015
2 2 Elementary Community Day School 91311 K- 6 XS 3-Scott M Schmerelson 10001 Jumilla Ave Chatsworth CA 91311
3 3 Cesar E Chavez Learning Academy - Arts/Theatre... 91340 9-12 NE 6-Kelly Gonez 1001 Arroyo Ave San Fernando CA 91340
4 4 San Pedro Senior High Gifted STEAM Magnet 90731 9-12 S 7-Dr Richard A Vladovic 1001 W 15th St San Pedro CA 90731

Geolocation data was not included in the school dataset, hence it will need to be obtained by other means. I'll use the school address paired with the google maps API to obtain the latitude and longitude of each school. Such data will also allow for the identification of duplicate entries (under different names) in the school list.

In [40]:
!pip install -q googlemaps
In [41]:
import googlemaps
In [42]:
# API Key
gmaps_key = googlemaps.Client(key = '')
In [43]:
# Lists to retrieve latitude and longitude
list_latitude = []
list_longitude = []
In [44]:
# Loop through the school addresses list
# This may take a while
for address in df_schools['complete_address']:
    
    # Obtain the address' geocode
    g = gmaps_key.geocode(address)
    
    # Extract latitude and longitude
    try:
        lat = g[0]['geometry']['location']['lat']
        lng = g[0]['geometry']['location']['lng']
    # In case of error fill with NaN
    except:
        lat = np.NaN
        lng = np.NaN
        
    # Apprend latitude and longide to the lists
    list_latitude.append(lat)
    list_longitude.append(lng)
In [45]:
# Adding the lists as columns on the schools dataframe
df_schools['latitude'] = list_latitude
df_schools['longitude'] = list_longitude
In [46]:
# Checking for missing values
df_schools.isnull().sum()
Out[46]:
Unnamed: 0          0
School              0
Zip Code            0
Grades              0
Local District      0
Board District      0
complete_address    0
latitude            8
longitude           8
dtype: int64
In [ ]:
# Using dropna() to avoid problems in case the API fails to fetch the geolocation of a school
df_schools = df_schools.dropna()
In [47]:
df_schools.shape
Out[47]:
(947, 9)
In [48]:
# Creating a single column with all geolocation data
# This column can be used to filter duplicate entries
df_schools['coordinates'] = df_schools['latitude'].astype('str') + ', ' + df_schools['longitude'].astype(str)
In [49]:
df_schools.head()
Out[49]:
Unnamed: 0 School Zip Code Grades Local District Board District complete_address latitude longitude coordinates
0 0 Frank Del Olmo Elementary 90004 K- 5 C 2-Monica Garcia 100 N New Hampshire Ave Los Angeles CA 90004 34.072925 -118.292591 34.072925, -118.292591
1 1 10th Street Elementary 90015 1- 5 C 2-Monica Garcia 1000 Grattan St Los Angeles CA 90015 34.048284 -118.273264 34.0482844, -118.273264
2 2 Elementary Community Day School 91311 K- 6 XS 3-Scott M Schmerelson 10001 Jumilla Ave Chatsworth CA 91311 34.251825 -118.563377 34.2518249, -118.5633769
3 3 Cesar E Chavez Learning Academy - Arts/Theatre... 91340 9-12 NE 6-Kelly Gonez 1001 Arroyo Ave San Fernando CA 91340 34.288514 -118.421011 34.2885136, -118.421011
4 4 San Pedro Senior High Gifted STEAM Magnet 90731 9-12 S 7-Dr Richard A Vladovic 1001 W 15th St San Pedro CA 90731 33.730370 -118.299342 33.73037, -118.2993419
In [131]:
# Using the coordinates to remove duplicate entries
df_schools = df_schools[~df_schools.duplicated(subset = 'coordinates')]
In [132]:
# Reseting index as rows were removed
df_schools.reset_index(drop = True, inplace = True)
In [133]:
# Shape
df_schools.shape
Out[133]:
(906, 9)

The shape shrank from 939 to 906, therefore there were indeed duplicate entries. Most likely schools which operated on the same location two different programs as two different institutions.

In [134]:
# Statistical summary
df_schools.describe()
Out[134]:
Zip Code latitude longitude
count 906.000000 906.000000 906.000000
mean 90475.435982 34.060198 -118.334404
std 587.265043 0.128336 0.116058
min 90001.000000 33.711431 -118.654393
25% 90026.000000 33.981266 -118.422510
50% 90063.000000 34.044304 -118.298979
75% 91306.750000 34.168790 -118.251660
max 91754.000000 34.319027 -118.152239
In [135]:
# The Unnamed: 0 column was the old index which became a column when the csv was imported
# It's not necessary. Removing it
df_schools.drop(labels = 'Unnamed: 0', axis = 1, inplace = True)
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-135-c7d6352dea32> in <module>
      1 # The Unnamed: 0 column was the old index which became a column when the csv was imported
      2 # It's not necessary. Removing it
----> 3 df_schools.drop(labels = 'Unnamed: 0', axis = 1, inplace = True)

~\Anaconda3\lib\site-packages\pandas\core\frame.py in drop(self, labels, axis, index, columns, level, inplace, errors)
   3995             level=level,
   3996             inplace=inplace,
-> 3997             errors=errors,
   3998         )
   3999 

~\Anaconda3\lib\site-packages\pandas\core\generic.py in drop(self, labels, axis, index, columns, level, inplace, errors)
   3934         for axis, labels in axes.items():
   3935             if labels is not None:
-> 3936                 obj = obj._drop_axis(labels, axis, level=level, errors=errors)
   3937 
   3938         if inplace:

~\Anaconda3\lib\site-packages\pandas\core\generic.py in _drop_axis(self, labels, axis, level, errors)
   3968                 new_axis = axis.drop(labels, level=level, errors=errors)
   3969             else:
-> 3970                 new_axis = axis.drop(labels, errors=errors)
   3971             result = self.reindex(**{axis_name: new_axis})
   3972 

~\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in drop(self, labels, errors)
   5015         if mask.any():
   5016             if errors != "ignore":
-> 5017                 raise KeyError(f"{labels[mask]} not found in axis")
   5018             indexer = indexer[~mask]
   5019         return self.delete(indexer)

KeyError: "['Unnamed: 0'] not found in axis"
In [136]:
df_schools.head()
Out[136]:
School Zip Code Grades Local District Board District complete_address latitude longitude coordinates
0 Frank Del Olmo Elementary 90004 K- 5 C 2-Monica Garcia 100 N New Hampshire Ave Los Angeles CA 90004 34.072925 -118.292591 34.072925, -118.292591
1 10th Street Elementary 90015 1- 5 C 2-Monica Garcia 1000 Grattan St Los Angeles CA 90015 34.048284 -118.273264 34.0482844, -118.273264
2 Elementary Community Day School 91311 K- 6 XS 3-Scott M Schmerelson 10001 Jumilla Ave Chatsworth CA 91311 34.251825 -118.563377 34.2518249, -118.5633769
3 Cesar E Chavez Learning Academy - Arts/Theatre... 91340 9-12 NE 6-Kelly Gonez 1001 Arroyo Ave San Fernando CA 91340 34.288514 -118.421011 34.2885136, -118.421011
4 San Pedro Senior High Gifted STEAM Magnet 90731 9-12 S 7-Dr Richard A Vladovic 1001 W 15th St San Pedro CA 90731 33.730370 -118.299342 33.73037, -118.2993419
In [137]:
# Saving the dataframe
df_schools.to_csv('files/df_schools_02.csv', index = False)
In [138]:
# Function to plot arrest location x school location
def geo_plot_03(dataframe_arrests, year):

    # Plot area
    fig, ax = plt.subplots(figsize = (25,15))
    
    # Scatterplot of Arrests
    plt.scatter(dataframe_arrests['longitude'], dataframe_arrests['latitude'], c = 'red', s = 12, label = 'Arrests')

    # Scatterplot of Dispensaries
    ax.scatter(df_schools['longitude'], df_schools['latitude'], c = 'g', s = 12, label = 'Schools')

    # Labels and Legend
    plt.xlabel('Longitude', fontsize = 20)
    plt.ylabel('Latitude', fontsize = 20)
    plt.xticks(fontsize = 15)
    plt.yticks(fontsize = 15)
    plt.legend(fontsize = 18)
    plt.title(f'{year}: Arrests x Schools', fontsize = 30)
    plt.tight_layout()
In [139]:
geo_plot_03(df_arrests_2014, 2014)
In [140]:
geo_plot_03(df_arrests_2015, 2015)
In [141]:
geo_plot_03(df_arrests_2016, 2016)
In [142]:
geo_plot_03(df_arrests_2017, 2017)
In [143]:
geo_plot_03(df_arrests_2018, 2018)
In [144]:
geo_plot_03(df_arrests_2019, 2019)

There are schools everywhere, so it's hard to assess visually whether there is a higher incidence of marijuana arrests near schools.

As instructed, my next step will be using the geopy package to calculate the distance between schools and arrests.

Distance from Arrests to Dispensaries and Schools

Distance Between Arrest Locations and Dispensaries

Here I'm looking at how many stores where in the area around each arrest

In [145]:
# Creating a dataframe to calculate distances - stores
df_stores_dist = df_stores[['name', 'latitude', 'longitude']]
In [146]:
df_stores_dist.shape
Out[146]:
(296, 3)
In [147]:
df_stores_dist.head()
Out[147]:
name latitude longitude
0 California Caregivers Alliance 34.082350 -118.272037
1 Herbarium 34.088530 -118.344600
2 MedMen Los Angeles - DTLA 34.044600 -118.254440
3 Green Earth Collective 34.122470 -118.210670
4 HERB 34.043499 -118.250206
In [148]:
# Creating a dataframe to calculate distances - arrests
df_arrests_dist = df_arrests[['Report ID', 'latitude', 'longitude']]
In [149]:
# The index is skipping rows, reseting it
df_arrests_dist.reset_index(drop = True, inplace = True)
In [150]:
df_arrests_dist.shape
Out[150]:
(6226, 3)
In [151]:
df_arrests_dist.head(3)
Out[151]:
Report ID latitude longitude
0 5568617 34.1016 -118.3350
1 5568629 34.1016 -118.3350
2 5607178 34.0838 -118.3614
In [152]:
# Loading the geopy package which can be used to calculate distance between coordinates
# The function which does this is geodesic()
import geopy
from geopy.distance import geodesic
In [153]:
%%time

# Loop through all stores and all arrests and calculate the distance
# This takes a while (more than 30 minutes)

for store in range(len(df_stores_dist)):
    
    # Get the data from that store
    store_name = df_stores_dist.iloc[store]['name']
    store_lat = df_stores_dist.iloc[store]['latitude']
    store_long = df_stores_dist.iloc[store]['longitude']
    
    # create new column on arrest_dist with store name
    df_arrests_dist[store_name] = np.NaN
    
    # Create an empty list to be filled with 
    # The distance between arrest and that given store
    distance_list = []
    
    # Loop though all arrests
    for arrest in range(len(df_arrests_dist)):
        
        # Get the geolocation from arrest
        arrest_lat = df_arrests_dist.iloc[arrest]['latitude']
        arrest_long = df_arrests_dist.iloc[arrest]['longitude']
        
        # Calculate distance from arrest to store
        distance = geodesic((store_lat, store_long), (arrest_lat, arrest_long)).miles
        
        # Add to the list containing all distances from a given store
        distance_list.append(distance)
        
    # Fill the store name column with the distances list
    df_arrests_dist[store_name] = distance_list
Wall time: 43min 5s
In [269]:
df_arrests_dist.to_csv('files/distances_dispensaries.csv')
In [154]:
df_arrests_dist
Out[154]:
Report ID latitude longitude California Caregivers Alliance Herbarium MedMen Los Angeles - DTLA Green Earth Collective HERB Kushfly The Higher Path ... Spectwonders Khonsu's Collective Cannacureshop Stiiizy Weed Delivery Trilogy Wellness The Blue Diamond Center Reefer Madness Clone Company Rolling Up Delivery Services OVO FINEST McLeonel dispensary
0 5568617 34.1016 -118.3350 3.846579 1.055725 6.065042 7.271511 6.300019 1.914766 6.853097 ... 15.634180 0.119404 1.549481 5.416763 43.626572 3.720157 33.563909 37.581619 39.645966 52.333731
1 5568629 34.1016 -118.3350 3.846579 1.055725 6.065042 7.271511 6.300019 1.914766 6.853097 ... 15.634180 0.119404 1.549481 5.416763 43.626572 3.720157 33.563909 37.581619 39.645966 52.333731
2 5607178 34.0838 -118.3614 5.125880 1.017099 6.704001 9.043892 6.956954 3.135336 6.375281 ... 15.416857 1.855468 0.600893 7.095336 45.160692 5.030097 34.207656 38.891827 41.101419 53.499184
3 5616516 33.9428 -118.4054 12.292510 10.633230 11.150758 16.679867 11.293960 13.176048 14.372352 ... 11.949733 11.623553 10.185298 17.071368 48.939045 15.068000 32.966616 41.323891 44.425795 54.636803
4 5616188 34.1667 -118.3965 9.203280 6.154616 11.711705 11.077266 11.935715 3.924004 2.758385 ... 21.287644 5.635323 6.305098 5.031280 47.351137 3.165782 39.014091 41.993391 43.651658 57.007198
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
6221 5608841 33.9922 -118.3276 6.983746 6.710759 5.538437 11.207795 5.677312 9.397726 12.600659 ... 9.428619 7.557099 6.429712 12.799623 43.876664 11.266752 29.700995 36.623164 39.469040 50.470909
6222 5608817 33.9922 -118.3276 6.983746 6.710759 5.538437 11.207795 5.677312 9.397726 12.600659 ... 9.428619 7.557099 6.429712 12.799623 43.876664 11.266752 29.700995 36.623164 39.469040 50.470909
6223 5607957 33.9432 -118.4013 12.125538 10.532117 10.951188 16.502507 11.091994 13.091117 14.379368 ... 11.726920 11.520181 10.089127 16.970902 48.703420 14.986992 32.743496 41.086769 44.189353 54.402916
6224 5611605 33.9420 -118.4095 12.483114 10.764484 11.368913 16.879015 11.514012 13.291205 14.396604 ... 12.167033 11.756588 10.312090 17.200508 49.180934 15.178964 33.183789 41.564068 44.667513 54.869409
6225 5607182 34.0382 -118.2603 3.116618 5.951259 0.554649 6.468228 0.684713 7.908279 12.833999 ... 9.869101 6.202127 6.062438 9.995559 39.596138 9.353346 27.614863 32.808075 35.332462 47.159513

6226 rows × 294 columns

In [155]:
# Creating a new dataframe with only distance data
df_dist_arrest_store = df_arrests_dist.drop(['Report ID', 'latitude', 'longitude'], axis = 1)
In [156]:
df_dist_arrest_store.head(3)
Out[156]:
California Caregivers Alliance Herbarium MedMen Los Angeles - DTLA Green Earth Collective HERB Kushfly The Higher Path Cannabis Cafe MedMen Venice Beach - Abbot Kinney Exhale Med Center ... Spectwonders Khonsu's Collective Cannacureshop Stiiizy Weed Delivery Trilogy Wellness The Blue Diamond Center Reefer Madness Clone Company Rolling Up Delivery Services OVO FINEST McLeonel dispensary
0 3.846579 1.055725 6.065042 7.271511 6.300019 1.914766 6.853097 0.818004 10.785022 2.518391 ... 15.634180 0.119404 1.549481 5.416763 43.626572 3.720157 33.563909 37.581619 39.645966 52.333731
1 3.846579 1.055725 6.065042 7.271511 6.300019 1.914766 6.853097 0.818004 10.785022 2.518391 ... 15.634180 0.119404 1.549481 5.416763 43.626572 3.720157 33.563909 37.581619 39.645966 52.333731
2 5.125880 1.017099 6.704001 9.043892 6.956954 3.135336 6.375281 1.150360 8.849409 0.923056 ... 15.416857 1.855468 0.600893 7.095336 45.160692 5.030097 34.207656 38.891827 41.101419 53.499184

3 rows × 291 columns

In [157]:
# For each arrest count how many stores were in a 0.5 mile range
df_arrests_dist['store_0.5_mile'] = (df_dist_arrest_store < 0.5).sum(axis = 1)
In [158]:
# For each arrest count how many stores were in a 1 mile range
df_arrests_dist['store_1_mile'] = (df_dist_arrest_store < 1).sum(axis = 1)
In [159]:
# Problem prevention: Checking df_arrests and df_arrests_dist
# To see if the 'Report ID' columns still match
len(df_arrests['Report ID'].unique()) == len(df_arrests_dist['Report ID'].unique())
Out[159]:
True
In [160]:
# Adding the columns with the calculated distances back to df_arrests
df_arrests = df_arrests.merge(df_arrests_dist[['Report ID', 'store_0.5_mile', 'store_1_mile']],
                            left_on = 'Report ID',
                            right_on = 'Report ID')
In [161]:
df_arrests.head(2)
Out[161]:
Report ID Arrest Date Time Area ID Area Name Reporting District Age Sex Code Descent Code Charge Group Code ... Address Cross Street Location latitude longitude marijuana_related store_0.5_mile_x store_1_mile_x store_0.5_mile_y store_1_mile_y
0 5568617 2019-03-09 2015.0 6 Hollywood 646 29 M O 16.0 ... CHEROKEE HOLLYWOOD ['34.1016', '-118.335'] 34.1016 -118.3350 possession marijuana for sale 7 8 7 8
1 5568629 2019-03-09 2015.0 6 Hollywood 646 25 M B 16.0 ... CHEROKEE HOLLYWOOD ['34.1016', '-118.335'] 34.1016 -118.3350 possession marijuana for sale 7 8 7 8
2 5607178 2019-04-21 2115.0 7 Wilshire 702 42 M B 16.0 ... MELROSE AV FAIRFAX AV ['34.0838', '-118.3614'] 34.0838 -118.3614 poss/sale marij ovr 21 employ per 20/belw 4 13 4 13

3 rows × 24 columns

In [162]:
# Dropping the 'Unnamed: 0' column
df_arrests = df_arrests.drop('Unnamed: 0', 1)
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-162-9369fa12cb49> in <module>
      1 # Dropping the 'Unnamed: 0' column
----> 2 df_arrests = df_arrests.drop('Unnamed: 0', 1)

~\Anaconda3\lib\site-packages\pandas\core\frame.py in drop(self, labels, axis, index, columns, level, inplace, errors)
   3995             level=level,
   3996             inplace=inplace,
-> 3997             errors=errors,
   3998         )
   3999 

~\Anaconda3\lib\site-packages\pandas\core\generic.py in drop(self, labels, axis, index, columns, level, inplace, errors)
   3934         for axis, labels in axes.items():
   3935             if labels is not None:
-> 3936                 obj = obj._drop_axis(labels, axis, level=level, errors=errors)
   3937 
   3938         if inplace:

~\Anaconda3\lib\site-packages\pandas\core\generic.py in _drop_axis(self, labels, axis, level, errors)
   3968                 new_axis = axis.drop(labels, level=level, errors=errors)
   3969             else:
-> 3970                 new_axis = axis.drop(labels, errors=errors)
   3971             result = self.reindex(**{axis_name: new_axis})
   3972 

~\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in drop(self, labels, errors)
   5015         if mask.any():
   5016             if errors != "ignore":
-> 5017                 raise KeyError(f"{labels[mask]} not found in axis")
   5018             indexer = indexer[~mask]
   5019         return self.delete(indexer)

KeyError: "['Unnamed: 0'] not found in axis"
In [163]:
# Saving df_arrests to disk
df_arrests.to_csv('files/df_arrests_03.csv', index = False)

Distance Between Arrests and Schools

Here I'm looking at how many arrests happened near each school

AND at how many schools were near each arrest

In [164]:
# Creating a dataframe to calculate distances - stores
df_schools_dist = df_schools[['School', 'latitude', 'longitude']].dropna()
In [165]:
df_schools_dist.shape
Out[165]:
(906, 3)
In [166]:
df_schools.isnull().sum()
Out[166]:
School              0
Zip Code            0
Grades              0
Local District      0
Board District      0
complete_address    0
latitude            0
longitude           0
coordinates         0
dtype: int64
In [167]:
df_schools_dist
Out[167]:
School latitude longitude
0 Frank Del Olmo Elementary 34.072925 -118.292591
1 10th Street Elementary 34.048284 -118.273264
2 Elementary Community Day School 34.251825 -118.563377
3 Cesar E Chavez Learning Academy - Arts/Theatre... 34.288514 -118.421011
4 San Pedro Senior High Gifted STEAM Magnet 33.730370 -118.299342
... ... ... ...
901 Madison Elementary 33.945729 -118.214513
902 Stonehurst Avenue Elementary STEAM Magnet 34.249666 -118.369889
903 Mariposa-Nabi Primary Center 34.053092 -118.299215
904 99th Street Elementary 33.946302 -118.258281
905 Alfred B Nobel Charter Middle School 34.251085 -118.551358

906 rows × 3 columns

In [168]:
# Dataframe for distance between arrests and schools
df_dist_arrest_school = df_arrests[['Report ID', 'latitude', 'longitude']]
In [169]:
# The index is skipping rows, reseting it
df_dist_arrest_school.reset_index(drop = True, inplace = True)
In [170]:
df_dist_arrest_school
Out[170]:
School latitude longitude
0 Frank Del Olmo Elementary 34.072925 -118.292591
1 10th Street Elementary 34.048284 -118.273264
2 Elementary Community Day School 34.251825 -118.563377
3 Cesar E Chavez Learning Academy - Arts/Theatre... 34.288514 -118.421011
4 San Pedro Senior High Gifted STEAM Magnet 33.730370 -118.299342
... ... ... ...
901 Madison Elementary 33.945729 -118.214513
902 Stonehurst Avenue Elementary STEAM Magnet 34.249666 -118.369889
903 Mariposa-Nabi Primary Center 34.053092 -118.299215
904 99th Street Elementary 33.946302 -118.258281
905 Alfred B Nobel Charter Middle School 34.251085 -118.551358

906 rows × 3 columns

In [171]:
%%time

# Loop through all stores and all arrests and calculate the distance
# This takes a long time! (2 hours or so)

for store in range(len(df_schools_dist)):
    
    # Get the data from that store
    store_name = df_schools_dist.iloc[store]['School']
    store_lat = df_schools_dist.iloc[store]['latitude']
    store_long = df_schools_dist.iloc[store]['longitude']
    
    # create new column on arrest_dist with store name
    df_dist_arrest_school[store_name] = np.NaN
    
    # Create an empty list to be filled with 
    # The distance between arrest and that given store
    distance_list = []
    
    # Loop though all arrests
    for arrest in range(len(df_arrests_dist)):
        
        # Get the geolocation from arrest
        arrest_lat = df_arrests_dist.iloc[arrest]['latitude']
        arrest_long = df_arrests_dist.iloc[arrest]['longitude']
        
        # Calculate distance from arrest to store
        distance = geodesic((store_lat, store_long), (arrest_lat, arrest_long)).miles
        
        # Add to the list containing all distances from a given store
        distance_list.append(distance)
        
    # Fill the store name column with the distances list
    df_dist_arrest_school[store_name] = distance_list
Wall time: 1h 44min 27s
In [268]:
df_dist_arrest_school.to_csv('files/distances_schools.csv')
In [172]:
df_dist_arrest_school.shape
Out[172]:
(6226, 909)
In [285]:
df_dist_arrest_school.head(2)
Out[285]:
Report ID latitude longitude Frank Del Olmo Elementary 10th Street Elementary Elementary Community Day School Cesar E Chavez Learning Academy - Arts/Theatre/Entertain Mag San Pedro Senior High Gifted STEAM Magnet Normont Elementary Stoney Point Continuation High ... San Miguel Elementary DL World Lang Im Mandarin Wadsworth Avenue Elementary Topeka Drive Charter for Advanced Studies Madison Elementary Stonehurst Avenue Elementary STEAM Magnet Mariposa-Nabi Primary Center 99th Street Elementary Alfred B Nobel Charter Middle School arrest_0.5_mile arrest_1_mile
0 5568617 34.1016 -118.3350 3.133832 5.103094 16.684527 13.792930 25.668208 21.199880 17.826477 ... 13.236804 7.709947 15.750312 12.776177 10.399451 3.923088 11.573941 16.117688 2 5
1 5568629 34.1016 -118.3350 3.133832 5.103094 16.684527 13.792930 25.668208 21.199880 17.826477 ... 13.236804 7.709947 15.750312 12.776177 10.399451 3.923088 11.573941 16.117688 2 5
2 5607178 34.0838 -118.3614 4.016920 5.617011 16.371597 14.517362 24.619110 20.219608 17.412544 ... 13.263019 7.807277 15.490766 12.713676 11.442783 4.147576 11.173228 15.855267 4 8

3 rows × 911 columns

In [174]:
df_dist_arrest_school_2 = df_dist_arrest_school.drop(['Report ID', 'latitude', 'longitude'], axis = 1)
In [284]:
df_dist_arrest_school_2.head(2)
Out[284]:
Frank Del Olmo Elementary 10th Street Elementary Elementary Community Day School Cesar E Chavez Learning Academy - Arts/Theatre/Entertain Mag San Pedro Senior High Gifted STEAM Magnet Normont Elementary Stoney Point Continuation High Montara Avenue Elementary Chatsworth Charter High School Our Community Charter School ... Beachy Avenue Elementary Hobart Boulevard Elementary DL Two-Way Im Spanish San Miguel Elementary DL World Lang Im Mandarin Wadsworth Avenue Elementary Topeka Drive Charter for Advanced Studies Madison Elementary Stonehurst Avenue Elementary STEAM Magnet Mariposa-Nabi Primary Center 99th Street Elementary Alfred B Nobel Charter Middle School
0 3.133832 5.103094 16.684527 13.792930 25.668208 21.199880 17.826477 12.742390 17.786585 16.745153 ... 11.675055 3.729257 13.236804 7.709947 15.750312 12.776177 10.399451 3.923088 11.573941 16.117688
1 3.133832 5.103094 16.684527 13.792930 25.668208 21.199880 17.826477 12.742390 17.786585 16.745153 ... 11.675055 3.729257 13.236804 7.709947 15.750312 12.776177 10.399451 3.923088 11.573941 16.117688
2 4.016920 5.617011 16.371597 14.517362 24.619110 20.219608 17.412544 12.643768 17.388639 16.434939 ... 12.121288 3.815014 13.263019 7.807277 15.490766 12.713676 11.442783 4.147576 11.173228 15.855267

3 rows × 906 columns

In [178]:
# For each ARREST count how many SCHOOLS were in a 0.5 mile range
df_dist_arrest_school['school_0.5_mile'] = (df_dist_arrest_school_2 < 0.5).sum(axis = 1)
In [179]:
# For each ARREST count how many SCHOOLS were in a 1 mile range
df_dist_arrest_school['school_1_mile'] = (df_dist_arrest_school_2 < 1).sum(axis = 1)
In [283]:
df_dist_arrest_school.head(2)
Out[283]:
Report ID latitude longitude Frank Del Olmo Elementary 10th Street Elementary Elementary Community Day School Cesar E Chavez Learning Academy - Arts/Theatre/Entertain Mag San Pedro Senior High Gifted STEAM Magnet Normont Elementary Stoney Point Continuation High ... San Miguel Elementary DL World Lang Im Mandarin Wadsworth Avenue Elementary Topeka Drive Charter for Advanced Studies Madison Elementary Stonehurst Avenue Elementary STEAM Magnet Mariposa-Nabi Primary Center 99th Street Elementary Alfred B Nobel Charter Middle School arrest_0.5_mile arrest_1_mile
0 5568617 34.1016 -118.3350 3.133832 5.103094 16.684527 13.792930 25.668208 21.199880 17.826477 ... 13.236804 7.709947 15.750312 12.776177 10.399451 3.923088 11.573941 16.117688 2 5
1 5568629 34.1016 -118.3350 3.133832 5.103094 16.684527 13.792930 25.668208 21.199880 17.826477 ... 13.236804 7.709947 15.750312 12.776177 10.399451 3.923088 11.573941 16.117688 2 5
2 5607178 34.0838 -118.3614 4.016920 5.617011 16.371597 14.517362 24.619110 20.219608 17.412544 ... 13.263019 7.807277 15.490766 12.713676 11.442783 4.147576 11.173228 15.855267 4 8

3 rows × 911 columns

In [181]:
# Now transposing the original distances dataframe in order to count the number of arrests near each school
df_dist_school_arrest = df_dist_arrest_school_2.T
In [184]:
# For each SCHOOL count how many ARREST happened in a 0.5 mile range
df_dist_school_arrest['arrest_0.5_mile'] = (df_dist_school_arrest < 0.5).sum(axis = 1)
In [185]:
# For each SCHOOL count how many ARREST happened in a 1 mile range
df_dist_school_arrest['arrest_1_mile'] = (df_dist_school_arrest < 1).sum(axis = 1)
In [282]:
df_dist_school_arrest.head(2)
Out[282]:
School 0 1 2 3 4 5 6 7 8 ... 6218 6219 6220 6221 6222 6223 6224 6225 arrest_0.5_mile arrest_1_mile
0 Frank Del Olmo Elementary 3.133832 3.133832 4.016920 11.061959 8.789582 11.094973 11.061959 11.059560 10.903239 ... 10.961893 10.961893 17.092165 5.915487 5.915487 10.903239 11.245559 3.026553 28 95
1 10th Street Elementary 5.103094 5.103094 5.617011 10.507209 10.795187 10.558389 10.507209 10.529555 10.319182 ... 10.428453 10.428453 18.882616 4.966615 4.966615 10.319182 10.716023 1.018030 26 141
2 Elementary Community Day School 16.684527 16.684527 16.371597 23.145724 11.213558 23.073752 23.145724 23.027052 23.213462 ... 23.001286 23.001286 4.527672 22.425167 22.425167 23.213462 23.105700 22.769924 1 3

3 rows × 6229 columns

In [193]:
df_dist_school_arrest.reset_index(inplace = True)
In [205]:
df_dist_school_arrest.rename(columns = {'index': 'School'}, inplace = True)
In [206]:
# Problem prevention: Checking df_arrests and df_arrests_dist
# To see if the 'Report ID' columns still match
len(df_schools['School'].unique()) == len(df_dist_school_arrest['School'].unique())
Out[206]:
True
In [209]:
# Adding two columns to the schools dataframe
# With the tally of how many arrests happened nearby
df_schools = df_schools.merge(df_dist_school_arrest[['School', 'arrest_0.5_mile', 'arrest_1_mile']],
                            left_on = 'School',
                            right_on = 'School')
In [210]:
df_schools.head(2)
Out[210]:
School Zip Code Grades Local District Board District complete_address latitude longitude coordinates arrest_0.5_mile arrest_1_mile
0 Frank Del Olmo Elementary 90004 K- 5 C 2-Monica Garcia 100 N New Hampshire Ave Los Angeles CA 90004 34.072925 -118.292591 34.072925, -118.292591 28 95
1 10th Street Elementary 90015 1- 5 C 2-Monica Garcia 1000 Grattan St Los Angeles CA 90015 34.048284 -118.273264 34.0482844, -118.273264 26 141
2 Elementary Community Day School 91311 K- 6 XS 3-Scott M Schmerelson 10001 Jumilla Ave Chatsworth CA 91311 34.251825 -118.563377 34.2518249, -118.5633769 1 3
In [211]:
df_arrests.head(2)
Out[211]:
Report ID Arrest Date Time Area ID Area Name Reporting District Age Sex Code Descent Code Charge Group Code ... Address Cross Street Location latitude longitude marijuana_related store_0.5_mile_x store_1_mile_x store_0.5_mile_y store_1_mile_y
0 5568617 2019-03-09 2015.0 6 Hollywood 646 29 M O 16.0 ... CHEROKEE HOLLYWOOD ['34.1016', '-118.335'] 34.1016 -118.3350 possession marijuana for sale 7 8 7 8
1 5568629 2019-03-09 2015.0 6 Hollywood 646 25 M B 16.0 ... CHEROKEE HOLLYWOOD ['34.1016', '-118.335'] 34.1016 -118.3350 possession marijuana for sale 7 8 7 8
2 5607178 2019-04-21 2115.0 7 Wilshire 702 42 M B 16.0 ... MELROSE AV FAIRFAX AV ['34.0838', '-118.3614'] 34.0838 -118.3614 poss/sale marij ovr 21 employ per 20/belw 4 13 4 13

3 rows × 24 columns

In [213]:
# Adding two columns to the arrests dataframe
# With the tally of how many schools were nearby
df_arrests['school_0.5_mile'] = (df_dist_arrest_school_2 < 0.5).sum(axis =1)
df_arrests['school_1_mile'] = (df_dist_arrest_school_2 < 1).sum(axis =1)
In [286]:
df_arrests.head(2)
Out[286]:
Report ID Arrest Date Time Area ID Area Name Reporting District Age Sex Code Descent Code Charge Group Code ... Location latitude longitude marijuana_related store_0.5_mile_x store_1_mile_x store_0.5_mile_y store_1_mile_y school_0.5_mile school_1_mile
0 5568617 2019-03-09 2015.0 6 Hollywood 646 29 M O 16.0 ... ['34.1016', '-118.335'] 34.1016 -118.335 possession marijuana for sale 7 8 7 8 2 5
1 5568629 2019-03-09 2015.0 6 Hollywood 646 25 M B 16.0 ... ['34.1016', '-118.335'] 34.1016 -118.335 possession marijuana for sale 7 8 7 8 2 5

2 rows × 26 columns

In [215]:
df_arrests.columns
Out[215]:
Index(['Report ID', 'Arrest Date', 'Time', 'Area ID', 'Area Name',
       'Reporting District', 'Age', 'Sex Code', 'Descent Code',
       'Charge Group Code', 'Charge Group Description', 'Arrest Type Code',
       'Charge', 'Charge Description', 'Address', 'Cross Street', 'Location',
       'latitude', 'longitude', 'marijuana_related', 'store_0.5_mile_x',
       'store_1_mile_x', 'store_0.5_mile_y', 'store_1_mile_y',
       'school_0.5_mile', 'school_1_mile'],
      dtype='object')
In [216]:
# Saving to disk all dataframes
df_arrests.to_csv('files/df_arrests_04.csv', index = False)
df_schools.to_csv('files/df_schools_03.csv' ,index = False)

Arrest Time Series

In [217]:
# Range of dates for indexing
index_dates = pd.date_range('2014-01-01', '2019-12-31')
In [218]:
df_arrests
Out[218]:
Report ID Arrest Date Time Area ID Area Name Reporting District Age Sex Code Descent Code Charge Group Code ... Location latitude longitude marijuana_related store_0.5_mile_x store_1_mile_x store_0.5_mile_y store_1_mile_y school_0.5_mile school_1_mile
0 5568617 2019-03-09 2015.0 6 Hollywood 646 29 M O 16.0 ... ['34.1016', '-118.335'] 34.1016 -118.3350 possession marijuana for sale 7 8 7 8 2 5
1 5568629 2019-03-09 2015.0 6 Hollywood 646 25 M B 16.0 ... ['34.1016', '-118.335'] 34.1016 -118.3350 possession marijuana for sale 7 8 7 8 2 5
2 5607178 2019-04-21 2115.0 7 Wilshire 702 42 M B 16.0 ... ['34.0838', '-118.3614'] 34.0838 -118.3614 poss/sale marij ovr 21 employ per 20/belw 4 13 4 13 4 8
3 5616516 2019-05-01 1045.0 14 Pacific 1494 26 M B 16.0 ... ['33.9428', '-118.4054'] 33.9428 -118.4054 sale/trans >28.5g marijuana or >4g 0 0 0 0 0 0
4 5616188 2019-05-01 5.0 15 N Hollywood 1543 27 M B 16.0 ... ['34.1667', '-118.3965'] 34.1667 -118.3965 sale/trans >28.5g marijuana or >4g 0 1 0 1 3 4
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
6221 5608841 2019-04-23 1020.0 12 77th Street 1211 57 M B 16.0 ... ['33.9922', '-118.3276'] 33.9922 -118.3276 cultivating <6 marijuana plants 2 4 2 4 7 11
6222 5608817 2019-04-23 1020.0 12 77th Street 1211 57 M B 16.0 ... ['33.9922', '-118.3276'] 33.9922 -118.3276 cultivating <6 marijuana plants 2 4 2 4 7 11
6223 5607957 2019-04-22 1531.0 14 Pacific 1494 36 M O 16.0 ... ['33.9432', '-118.4013'] 33.9432 -118.4013 transport/sell/furnish/etc marijuana 0 0 0 0 0 0
6224 5611605 2019-04-26 1130.0 14 Pacific 1494 22 M B 16.0 ... ['33.942', '-118.4095'] 33.9420 -118.4095 transport/sell/furnish/etc marijuana 0 0 0 0 0 0
6225 5607182 2019-04-21 2140.0 1 Central 185 19 M H 16.0 ... ['34.0382', '-118.2603'] 34.0382 -118.2603 possess marijuana for sale 10 17 10 17 0 10

6226 rows × 26 columns

In [219]:
# Dataframe with arrests grouped per day (categorial variables will disappear)
df_arrests_g = df_arrests.groupby('Arrest Date').sum()
In [220]:
# Tally of arrest near schools per day
df_arrests_g['school_0.5_mile'].head()
Out[220]:
Arrest Date
2014-01-01    14
2014-01-02    15
2014-01-03     1
2014-01-04     8
2014-01-05     5
Name: school_0.5_mile, dtype: int64
In [221]:
# Tally of arrest near schools per day
df_arrests_g['school_1_mile'].head()
Out[221]:
Arrest Date
2014-01-01    39
2014-01-02    48
2014-01-03    10
2014-01-04    32
2014-01-05    19
Name: school_1_mile, dtype: int64
In [222]:
# Making dataframes
df_arrests_half_mile_schools = pd.DataFrame(df_arrests_g['school_0.5_mile'])
df_arrests_one_mile_schools = pd.DataFrame(df_arrests_g['school_1_mile'])
In [225]:
df_arrests_half_mile_schools.head()
Out[225]:
school_0.5_mile
Arrest Date
2014-01-01 14
2014-01-02 15
2014-01-03 1
2014-01-04 8
2014-01-05 5
In [226]:
df_arrests_one_mile_schools.head()
Out[226]:
school_1_mile
Arrest Date
2014-01-01 39
2014-01-02 48
2014-01-03 10
2014-01-04 32
2014-01-05 19

The dataframe index is of type: Index

To be able to tally the monthly totals, it must be of type: DateTimeIndex

In [227]:
# Checking index type
df_arrests_half_mile_schools.index
Out[227]:
Index(['2014-01-01', '2014-01-02', '2014-01-03', '2014-01-04', '2014-01-05',
       '2014-01-06', '2014-01-07', '2014-01-08', '2014-01-09', '2014-01-10',
       ...
       '2019-12-15', '2019-12-16', '2019-12-17', '2019-12-18', '2019-12-19',
       '2019-12-20', '2019-12-22', '2019-12-26', '2019-12-28', '2019-12-29'],
      dtype='object', name='Arrest Date', length=1691)
In [228]:
# Converting index type
df_arrests_half_mile_schools.index = pd.to_datetime(df_arrests_half_mile_schools.index)
df_arrests_one_mile_schools.index = pd.to_datetime(df_arrests_one_mile_schools.index)
In [229]:
# Using resampling to tally monthly data
df_arrests_half_mile_schools.resample('M').sum().head()
Out[229]:
school_0.5_mile
Arrest Date
2014-01-31 381
2014-02-28 302
2014-03-31 332
2014-04-30 255
2014-05-31 263

Timeseries of Arrests Near Schools

In [230]:
# Number of arrests with a school in a 0.5-mile area

# Figure
plt.figure(figsize = (20,10))

# Plot data
plt.plot(df_arrests_half_mile_schools.resample('M').sum(), color = 'blue', linewidth = 3)

# Labels and Legend
plt.xticks(fontsize = 15)
plt.yticks(fontsize = 15)
plt.title('\nNumber of Arrests with Schools on a 0.5-Mile Range\n', fontsize = 30)
plt.xlabel('\nMonth of Year')
plt.ylabel('\nNumber of Arrests')
plt.tight_layout()
In [231]:
# Number of arrests with a school in a 0.5-mile area

# Figure
plt.figure(figsize = (20,10))

# Plot data
plt.plot(df_arrests_one_mile_schools.resample('M').sum(), color = 'blue', linewidth = 3)

# Labels and Legend
plt.xticks(fontsize = 15)
plt.yticks(fontsize = 15)
plt.title('\nNumber of Arrests with Schools on a 1-Mile Range\n', fontsize = 30)
plt.xlabel('\nMonth of Year')
plt.ylabel('\nNumber of Arrests')
plt.tight_layout()

Something changed in 2017 that caused arrests near schools to drop. Was it marijuana legalization? I cannot be affirmed with total certainty, but everything thus far points on that direction. One important question is: Have the 'crimes' been reduced, or was this simply a change in how they are accounted for (what used to be a crime no longer is)?

Top Reasons for Arrest

In [232]:
# Function to plot the tally of arrests per crime on each year
def plot_total_crime_type(dataframe, year):
    
    # Filtering the dataframe by type of crime
    # Counting the number of occurances per type
    # Rearrange in ascending order
    # Create a bar plot
    ax = dataframe['Charge Description'].value_counts(ascending = True).tail().plot(kind = 'bar',
                                                                                    figsize = (16,10),
                                                                                    color = 'magenta',
                                                                                    alpha = 0.7,
                                                                                    rot = 30)
    
    ax.set_title(f'\nTally of Arrests per Crime Type in {year}\n', fontsize = 30)
    ax.set_ylabel('\nCrime Type\n', fontsize = 18);
    ax.set_xlabel('\nCrime Tally\n', fontsize = 18)
In [233]:
# Before plotting the data there is an issue to fix on the dataset
# There are two entries for "possess 28.5 grams or less of marijuana"
# With the second one having an extra "**"
# Merging both those
df_arrests['Charge Description'] = df_arrests['Charge Description'].map(lambda x: 'possess 28.5 grams or less of marijuana' if x == 'possess 28.5 grams or less of marijuana**' else x)
In [234]:
# Creating new dataframes per year
df_arrests_2014 = df_arrests[(df_arrests['Arrest Date'] <= '2014-12-31') & (df_arrests['Arrest Date'] >= '2014-01-01')]
df_arrests_2015 = df_arrests[(df_arrests['Arrest Date'] <= '2015-12-31') & (df_arrests['Arrest Date'] >= '2015-01-01')]
df_arrests_2016 = df_arrests[(df_arrests['Arrest Date'] <= '2016-12-31') & (df_arrests['Arrest Date'] >= '2016-01-01')]
df_arrests_2017 = df_arrests[(df_arrests['Arrest Date'] <= '2017-12-31') & (df_arrests['Arrest Date'] >= '2017-01-01')]
df_arrests_2018 = df_arrests[(df_arrests['Arrest Date'] <= '2018-12-31') & (df_arrests['Arrest Date'] >= '2018-01-01')]
df_arrests_2019 = df_arrests[(df_arrests['Arrest Date'] <= '2019-12-31') & (df_arrests['Arrest Date'] >= '2019-01-01')]
In [235]:
# Total Arrests per Crime in 2014
plot_total_crime_type(df_arrests_2014, 2014)
In [241]:
# Total Arrests per Crime in 2015
plot_total_crime_type(df_arrests_2015, 2015)
In [242]:
# Total Arrests per Crime in 2016
plot_total_crime_type(df_arrests_2016, 2016)
In [243]:
# Total Arrests per Crime in 2017
plot_total_crime_type(df_arrests_2017, 2017)
In [244]:
# Total Arrests per Crime in 2018
plot_total_crime_type(df_arrests_2018, 2018)
In [245]:
# Total Arrests per Crime in 2019
plot_total_crime_type(df_arrests_2019, 2019)

Statistical Modeling

In [246]:
df_arrests.head()
Out[246]:
Report ID Arrest Date Time Area ID Area Name Reporting District Age Sex Code Descent Code Charge Group Code ... Location latitude longitude marijuana_related store_0.5_mile_x store_1_mile_x store_0.5_mile_y store_1_mile_y school_0.5_mile school_1_mile
0 5568617 2019-03-09 2015.0 6 Hollywood 646 29 M O 16.0 ... ['34.1016', '-118.335'] 34.1016 -118.3350 possession marijuana for sale 7 8 7 8 2 5
1 5568629 2019-03-09 2015.0 6 Hollywood 646 25 M B 16.0 ... ['34.1016', '-118.335'] 34.1016 -118.3350 possession marijuana for sale 7 8 7 8 2 5
2 5607178 2019-04-21 2115.0 7 Wilshire 702 42 M B 16.0 ... ['34.0838', '-118.3614'] 34.0838 -118.3614 poss/sale marij ovr 21 employ per 20/belw 4 13 4 13 4 8
3 5616516 2019-05-01 1045.0 14 Pacific 1494 26 M B 16.0 ... ['33.9428', '-118.4054'] 33.9428 -118.4054 sale/trans >28.5g marijuana or >4g 0 0 0 0 0 0
4 5616188 2019-05-01 5.0 15 N Hollywood 1543 27 M B 16.0 ... ['34.1667', '-118.3965'] 34.1667 -118.3965 sale/trans >28.5g marijuana or >4g 0 1 0 1 3 4

5 rows × 26 columns

In [247]:
df_arrests.describe().T
Out[247]:
count mean std min 25% 50% 75% max
Report ID 6226.0 4.834485e+07 6.881528e+07 2586.0000 4.350713e+06 4.811214e+06 1.410136e+08 2.006068e+08
Time 6226.0 1.510446e+03 5.321543e+02 1.0000 1.255000e+03 1.600000e+03 1.900000e+03 2.358000e+03
Area ID 6226.0 1.100594e+01 5.731738e+00 1.0000 6.000000e+00 1.200000e+01 1.500000e+01 2.100000e+01
Reporting District 6226.0 1.149005e+03 5.727965e+02 101.0000 6.460000e+02 1.235000e+03 1.546750e+03 2.197000e+03
Age 6226.0 3.049213e+01 1.134049e+01 12.0000 2.200000e+01 2.700000e+01 3.600000e+01 7.800000e+01
Charge Group Code 6226.0 1.600000e+01 0.000000e+00 16.0000 1.600000e+01 1.600000e+01 1.600000e+01 1.600000e+01
latitude 6226.0 3.406042e+01 1.145151e-01 33.7096 3.398750e+01 3.404770e+01 3.413495e+01 3.432860e+01
longitude 6226.0 -1.183571e+02 9.521823e-02 -118.6574 -1.184335e+02 -1.183337e+02 -1.182783e+02 -1.181624e+02
store_0.5_mile_x 6226.0 1.567298e+00 2.293777e+00 0.0000 0.000000e+00 0.000000e+00 2.000000e+00 1.100000e+01
store_1_mile_x 6226.0 4.098458e+00 4.885166e+00 0.0000 0.000000e+00 2.000000e+00 6.000000e+00 2.600000e+01
store_0.5_mile_y 6226.0 1.567298e+00 2.293777e+00 0.0000 0.000000e+00 0.000000e+00 2.000000e+00 1.100000e+01
store_1_mile_y 6226.0 4.098458e+00 4.885166e+00 0.0000 0.000000e+00 2.000000e+00 6.000000e+00 2.600000e+01
school_0.5_mile 6226.0 2.071314e+00 1.851271e+00 0.0000 1.000000e+00 2.000000e+00 3.000000e+00 1.100000e+01
school_1_mile 6226.0 7.819949e+00 5.686667e+00 0.0000 4.000000e+00 7.000000e+00 1.100000e+01 3.300000e+01
In [248]:
df_arrests.dtypes
Out[248]:
Report ID                     int64
Arrest Date                  object
Time                        float64
Area ID                       int64
Area Name                    object
Reporting District            int64
Age                           int64
Sex Code                     object
Descent Code                 object
Charge Group Code           float64
Charge Group Description     object
Arrest Type Code             object
Charge                       object
Charge Description           object
Address                      object
Cross Street                 object
Location                     object
latitude                    float64
longitude                   float64
marijuana_related            object
store_0.5_mile_x              int64
store_1_mile_x                int64
store_0.5_mile_y              int64
store_1_mile_y                int64
school_0.5_mile               int64
school_1_mile                 int64
dtype: object
In [253]:
df_arrests.nunique()
Out[253]:
Report ID                   6226
Arrest Date                 1691
Time                         430
Area ID                       21
Area Name                     21
Reporting District           904
Age                           66
Sex Code                       2
Descent Code                  10
Charge Group Code              1
Charge Group Description       1
Arrest Type Code               3
Charge                        31
Charge Description            31
Address                     2721
Cross Street                1142
Location                    3111
latitude                    1737
longitude                   1654
marijuana_related             32
store_0.5_mile_x              12
store_1_mile_x                26
store_0.5_mile_y              12
store_1_mile_y                26
school_0.5_mile               12
school_1_mile                 34
dtype: int64
In [254]:
df_arrests_vars = df_arrests.describe().columns
In [258]:
# Histograms

# Figure
fig = plt.figure(figsize = (32,16))

# Plot a histogram of each variable
for i, col in enumerate(df_arrests_vars):
    fig.add_subplot(4,4,1+i)
    col_data = df_arrests[col]
    plt.hist(col_data, color = 'khaki')
    plt.title(col)

Looks like none of the variables follows a normal distribution. This can be later checked with am appropriate test!

In [260]:
# Now moving to the schools dataframe
df_schools.describe()
Out[260]:
Zip Code latitude longitude arrest_0.5_mile arrest_1_mile
count 906.000000 906.000000 906.000000 906.000000 906.000000
mean 90475.435982 34.060198 -118.334404 14.233996 53.929360
std 587.265043 0.128336 0.116058 28.174758 68.352301
min 90001.000000 33.711431 -118.654393 0.000000 1.000000
25% 90026.000000 33.981266 -118.422510 1.000000 12.000000
50% 90063.000000 34.044304 -118.298979 7.000000 32.000000
75% 91306.750000 34.168790 -118.251660 19.000000 73.750000
max 91754.000000 34.319027 -118.152239 552.000000 677.000000
In [261]:
df_schools_vars = ['arrest_0.5_mile', 'arrest_1_mile']
In [263]:
# Histograms

# Figure
fig = plt.figure(figsize = (16,8))

# Plot a histogram of each variable
for i, col in enumerate(df_schools_vars):
    fig.add_subplot(2,4,1+i)
    col_data = df_schools[col]
    plt.hist(col_data, color = 'green')
    plt.title(col)

Again doesn't seem to be normally distributed.

Normality Test

Quick Review: The p-value is the probability of obtaining a test statistic at least as extreme as the one observed when the null hypothesis is true. The null hypothesis (H0) affirms that the population is normally distributed, against the alternative hypothesis (H1) that it is not normally distributed.

In [264]:
# Function to test variable normality
def test_normality(dataframe, list_of_cols, sig_level = 0.05):
    
    # List columns on dataframe
    total_variables = len(list_of_cols)
    
    # Start the counter of varibles which are 'non normal'
    not_normal_variables = 0
    
    # Loop through each column
    for col in list_of_cols:
        
        # Run the normality test (from statsmodels)
        p_val = stats.normaltest(dataframe[col])[1]
        
        # Significance Level
        sig_lvl = 0.05
        
        # Check p-value
        if p_val < sig_lvl:
            print(f'{col}')
            print(f'P-value: {p_val}\nSignificance Level: {sig_lvl}\n')
            print("The p-value is lower than the Significance Level, hence the null hypothesis is rejected and we conclude that the variable is not normally distributed.\n")
            not_normal_variables += 1
        elif p_val > sig_lvl:
            print(f'{col}')
            print(f'P-value: {p_val}\nSignificance Level: {sig_lvl}\n')
            print("The p-value is greater than the Significance Level, hence we fail to reject the null hypothesis that the variable is normally distributed.\n")
    
    if not_normal_variables > 0:
        print(f'{not_normal_variables} variables failed the normality test')
In [265]:
# Test normality for arrests
test_normality(df_arrests, df_arrests_vars)
Report ID
P-value: 0.0
Significance Level: 0.05

The p-value is lower than the Significance Level, hence the null hypothesis is rejected and we conclude that the variable is not normally distributed.

Time
P-value: 3.7075045947738646e-165
Significance Level: 0.05

The p-value is lower than the Significance Level, hence the null hypothesis is rejected and we conclude that the variable is not normally distributed.

Area ID
P-value: 0.0
Significance Level: 0.05

The p-value is lower than the Significance Level, hence the null hypothesis is rejected and we conclude that the variable is not normally distributed.

Reporting District
P-value: 0.0
Significance Level: 0.05

The p-value is lower than the Significance Level, hence the null hypothesis is rejected and we conclude that the variable is not normally distributed.

Age
P-value: 1.1205894980358965e-190
Significance Level: 0.05

The p-value is lower than the Significance Level, hence the null hypothesis is rejected and we conclude that the variable is not normally distributed.

Charge Group Code
P-value: 0.0
Significance Level: 0.05

The p-value is lower than the Significance Level, hence the null hypothesis is rejected and we conclude that the variable is not normally distributed.

latitude
P-value: 8.438890053160126e-14
Significance Level: 0.05

The p-value is lower than the Significance Level, hence the null hypothesis is rejected and we conclude that the variable is not normally distributed.

longitude
P-value: 2.5496285084474583e-78
Significance Level: 0.05

The p-value is lower than the Significance Level, hence the null hypothesis is rejected and we conclude that the variable is not normally distributed.

store_0.5_mile_x
P-value: 0.0
Significance Level: 0.05

The p-value is lower than the Significance Level, hence the null hypothesis is rejected and we conclude that the variable is not normally distributed.

store_1_mile_x
P-value: 0.0
Significance Level: 0.05

The p-value is lower than the Significance Level, hence the null hypothesis is rejected and we conclude that the variable is not normally distributed.

store_0.5_mile_y
P-value: 0.0
Significance Level: 0.05

The p-value is lower than the Significance Level, hence the null hypothesis is rejected and we conclude that the variable is not normally distributed.

store_1_mile_y
P-value: 0.0
Significance Level: 0.05

The p-value is lower than the Significance Level, hence the null hypothesis is rejected and we conclude that the variable is not normally distributed.

school_0.5_mile
P-value: 1.8728095723343848e-299
Significance Level: 0.05

The p-value is lower than the Significance Level, hence the null hypothesis is rejected and we conclude that the variable is not normally distributed.

school_1_mile
P-value: 1.0155192688748858e-236
Significance Level: 0.05

The p-value is lower than the Significance Level, hence the null hypothesis is rejected and we conclude that the variable is not normally distributed.

14 variables failed the normality test
In [266]:
# Test normality for schools
test_normality(df_schools, df_schools_vars)
arrest_0.5_mile
P-value: 0.0
Significance Level: 0.05

The p-value is lower than the Significance Level, hence the null hypothesis is rejected and we conclude that the variable is not normally distributed.

arrest_1_mile
P-value: 2.8884711175727554e-161
Significance Level: 0.05

The p-value is lower than the Significance Level, hence the null hypothesis is rejected and we conclude that the variable is not normally distributed.

2 variables failed the normality test

Conclusion: None of the variables is normally distributed. This means that much caution is needed when creating the predictive models.

Final Conclusion

Clearly there seems to be a pattern between number of arrests and marijuana legalization. The proximity between arrests and dispensaries or schools doesn't seem to change from year to year, which seems to imply that there is no correlation between them. The important question to answer now is whether is there is some other factor which could better explain the drop in marijuana related crimes.

Marijuana_Crime_03

Marijuana_Crime_03

Tasks for part 3:

  • Predictive modeling
  • Time series analysis
  • Conclusions
In [1]:
# Disabling the multiple messages generated by the new versions from Pandas and Matplotlib
import sys
import warnings
import matplotlib.cbook
if not sys.warnoptions:
    warnings.simplefilter("ignore")
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.filterwarnings("ignore", category=FutureWarning)
warnings.filterwarnings("ignore", category=matplotlib.cbook.mplDeprecation)

# Data manipulation and visualization
import math
import numpy as np
import pandas as pd
import matplotlib as m
import matplotlib.pyplot as plt
from scipy import stats

# Machine Learning
import sklearn
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import StandardScaler

# Time series analysis
import pickle
import calendar
import statsmodels
from statsmodels.tsa.arima_model import ARIMA, ARMA
from sklearn.metrics import mean_squared_error
from statsmodels.tsa.stattools import adfuller 
from statsmodels.tsa.stattools import acf, pacf

# Graphics formatting
m.rcParams['axes.labelsize'] = 14
m.rcParams['xtick.labelsize'] = 12
m.rcParams['ytick.labelsize'] = 12
from matplotlib.pylab import rcParams 
rcParams['figure.figsize'] = 15,7
plt.style.use('fivethirtyeight')
%matplotlib inline

Predictive Modeling

In [2]:
# Loading data
df_arrests = pd.read_csv('files/df_arrests_04.csv')
In [3]:
df_arrests.shape
Out[3]:
(6226, 26)
In [4]:
df_arrests.head()
Out[4]:
Report ID Arrest Date Time Area ID Area Name Reporting District Age Sex Code Descent Code Charge Group Code ... Location latitude longitude marijuana_related store_0.5_mile_x store_1_mile_x store_0.5_mile_y store_1_mile_y school_0.5_mile school_1_mile
0 5568617 2019-03-09 2015.0 6 Hollywood 646 29 M O 16.0 ... ['34.1016', '-118.335'] 34.1016 -118.3350 possession marijuana for sale 7 8 7 8 2 5
1 5568629 2019-03-09 2015.0 6 Hollywood 646 25 M B 16.0 ... ['34.1016', '-118.335'] 34.1016 -118.3350 possession marijuana for sale 7 8 7 8 2 5
2 5607178 2019-04-21 2115.0 7 Wilshire 702 42 M B 16.0 ... ['34.0838', '-118.3614'] 34.0838 -118.3614 poss/sale marij ovr 21 employ per 20/belw 4 13 4 13 4 8
3 5616516 2019-05-01 1045.0 14 Pacific 1494 26 M B 16.0 ... ['33.9428', '-118.4054'] 33.9428 -118.4054 sale/trans >28.5g marijuana or >4g 0 0 0 0 0 0
4 5616188 2019-05-01 5.0 15 N Hollywood 1543 27 M B 16.0 ... ['34.1667', '-118.3965'] 34.1667 -118.3965 sale/trans >28.5g marijuana or >4g 0 1 0 1 3 4

5 rows × 26 columns

In [7]:
df_arrests.drop(labels = ['store_0.5_mile_y', 'store_1_mile_y'], axis = 1, inplace = True)
In [12]:
df_arrests.rename(axis = 1, mapper = {'store_0.5_mile_x' : 'store_0.5_mile', 'store_1_mile_x' : 'store_1_mile'}, inplace = True)
In [13]:
df_arrests.head(3)
Out[13]:
Report ID Arrest Date Time Area ID Area Name Reporting District Age Sex Code Descent Code Charge Group Code ... Address Cross Street Location latitude longitude marijuana_related store_0.5_mile store_1_mile school_0.5_mile school_1_mile
0 5568617 2019-03-09 2015.0 6 Hollywood 646 29 M O 16.0 ... CHEROKEE HOLLYWOOD ['34.1016', '-118.335'] 34.1016 -118.3350 possession marijuana for sale 7 8 2 5
1 5568629 2019-03-09 2015.0 6 Hollywood 646 25 M B 16.0 ... CHEROKEE HOLLYWOOD ['34.1016', '-118.335'] 34.1016 -118.3350 possession marijuana for sale 7 8 2 5
2 5607178 2019-04-21 2115.0 7 Wilshire 702 42 M B 16.0 ... MELROSE AV FAIRFAX AV ['34.0838', '-118.3614'] 34.0838 -118.3614 poss/sale marij ovr 21 employ per 20/belw 4 13 4 8

3 rows × 24 columns

In [14]:
# Listing all charge descriptions
charge_description_list = list(df_arrests['Charge Description'].unique())
In [15]:
# Ordering the list
charge_description_list.sort()
In [16]:
charge_description_list
Out[16]:
['attempt - sell/furnish/etc marijuana',
 'cultivate >6 marij plants viol envrnt law',
 'cultivating <6 marijuana plants',
 'furnishing marijuana to minor over 14 yrs',
 'give/transport/etc < 28.5 grams marijuana',
 'induce/etc minor to use/sell marijuana',
 'minor poss 28.5+ grams marijuana/school',
 'minor poss < 28.5 grams marijuana/school',
 'poss for sale of marijuana to a minor',
 'poss marijuana or concentrated cannabis',
 'poss of more than 28.5 grams of marijuana',
 'poss open cont/packg marij drivr/passnger',
 'poss/sale marij ovr 21 employ per 20/belw',
 'poss/smoke/ingest marij school/daycare/yc',
 'possess 28.5 grams or less of marijuana',
 'possess 28.5 grams or less of marijuana**',
 'possess marijuana for sale',
 'possess marijuana for sale under age 18',
 'possess of marijuana while driving veh',
 'possession marijuana for sale',
 'possession of marijuana in school',
 'sale/offer to sell/transport marijuana',
 'sale/trans >28.5g marijuana or >4g',
 'sale/transport marijuana to a minor',
 'sale/transport marijuana w/priors',
 'sell/furnish/etc marij/hash over 1 oz',
 'smoke marij loc where smoke tobacco proh',
 'smoke/ingest marijuana in public place',
 'transport/etc marijuana',
 'transport/sell/furnish/etc marijuana',
 'unlawfully plant/cultivate/etc marijuana',
 'use minor under 14 trans/sell marijuana']

Feature Engineering

Creating a target variable, which will be based on whether or not the charge was related to marijuana sale.

In [17]:
# Using map function to identify marijuana sale
target = df_arrests['Charge Description'].map(lambda x: 'sell' if 'sale' in x else x)
In [19]:
# Identifying sale or not
target = target.map(lambda x: 'sell' if 'sell' in x else x)
target = target.map(lambda x: 'not sell' if 'sell' not in x else x)
In [20]:
# Counting the number of examples in each class
target.value_counts(normalize = True)
Out[20]:
sell        0.616929
not sell    0.383071
Name: Charge Description, dtype: float64
In [21]:
# Adding the variabe on the dataframe
df_arrests['target'] = target
In [22]:
# Encoding the variable
df_arrests['target'] = df_arrests['target'].map(lambda x: 1 if x == 'sell' else 0)
In [28]:
df_arrests.head()
Out[28]:
Report ID Arrest Date Time Area ID Area Name Reporting District Age Sex Code Descent Code Charge Group Code ... Cross Street Location latitude longitude marijuana_related store_0.5_mile store_1_mile school_0.5_mile school_1_mile target
0 5568617 2019-03-09 2015.0 6 Hollywood 646 29 M O 16.0 ... HOLLYWOOD ['34.1016', '-118.335'] 34.1016 -118.3350 possession marijuana for sale 7 8 2 5 1
1 5568629 2019-03-09 2015.0 6 Hollywood 646 25 M B 16.0 ... HOLLYWOOD ['34.1016', '-118.335'] 34.1016 -118.3350 possession marijuana for sale 7 8 2 5 1
2 5607178 2019-04-21 2115.0 7 Wilshire 702 42 M B 16.0 ... FAIRFAX AV ['34.0838', '-118.3614'] 34.0838 -118.3614 poss/sale marij ovr 21 employ per 20/belw 4 13 4 8 1
3 5616516 2019-05-01 1045.0 14 Pacific 1494 26 M B 16.0 ... NaN ['33.9428', '-118.4054'] 33.9428 -118.4054 sale/trans >28.5g marijuana or >4g 0 0 0 0 1
4 5616188 2019-05-01 5.0 15 N Hollywood 1543 27 M B 16.0 ... WEDDINGTON ['34.1667', '-118.3965'] 34.1667 -118.3965 sale/trans >28.5g marijuana or >4g 0 1 3 4 1

5 rows × 25 columns

In [29]:
# Checking correlation between the two dispensary distance variables
df_arrests[['store_0.5_mile', 'store_1_mile']].corr()
Out[29]:
store_0.5_mile store_1_mile
store_0.5_mile 1.000000 0.834965
store_1_mile 0.834965 1.000000
In [30]:
# Checking correlation between the two school distance variables
df_arrests[['school_0.5_mile', 'school_1_mile']].corr()
Out[30]:
school_0.5_mile school_1_mile
school_0.5_mile 1.000000 0.779701
school_1_mile 0.779701 1.000000

The variables are highly correlated, so one of them will need to be dropped. I'll also drop some other variables which shouldn't be part of the model, such as the 'Charge Description' which was used to generate the target, as well as the 'Arrest Date', since the goal will be predicting how things might have played out with a law change. I'll keep the latitude and longitude variables and remove most other variables which refer to location.

In [31]:
df_arrests.columns
Out[31]:
Index(['Report ID', 'Arrest Date', 'Time', 'Area ID', 'Area Name',
       'Reporting District', 'Age', 'Sex Code', 'Descent Code',
       'Charge Group Code', 'Charge Group Description', 'Arrest Type Code',
       'Charge', 'Charge Description', 'Address', 'Cross Street', 'Location',
       'latitude', 'longitude', 'marijuana_related', 'store_0.5_mile',
       'store_1_mile', 'school_0.5_mile', 'school_1_mile', 'target'],
      dtype='object')
In [38]:
X = df_arrests.drop(['Report ID', 'Arrest Date', 'Area ID', 'Reporting District',
                     'Charge Group Code', 'Charge Group Description', 'Charge',
                     'Charge Description', 'Address', 'Cross Street', 'Location',
                     'marijuana_related', 'store_0.5_mile', 'school_0.5_mile',
                     'target'], axis = 1)
In [39]:
x
Out[39]:
Time Area Name Age Sex Code Descent Code Arrest Type Code latitude longitude store_1_mile school_1_mile
0 2015.0 Hollywood 29 M O F 34.1016 -118.3350 8 5
1 2015.0 Hollywood 25 M B F 34.1016 -118.3350 8 5
2 2115.0 Wilshire 42 M B F 34.0838 -118.3614 13 8
3 1045.0 Pacific 26 M B F 33.9428 -118.4054 0 0
4 5.0 N Hollywood 27 M B F 34.1667 -118.3965 1 4
... ... ... ... ... ... ... ... ... ... ...
6221 1020.0 77th Street 57 M B M 33.9922 -118.3276 4 11
6222 1020.0 77th Street 57 M B M 33.9922 -118.3276 4 11
6223 1531.0 Pacific 36 M O F 33.9432 -118.4013 0 0
6224 1130.0 Pacific 22 M B F 33.9420 -118.4095 0 0
6225 2140.0 Central 19 M H M 34.0382 -118.2603 17 10

6226 rows × 10 columns

In [40]:
# Creating dummy variables for the categorical variables
X = pd.get_dummies(X)
In [41]:
X.head()
Out[41]:
Time Age latitude longitude store_1_mile school_1_mile Area Name_77th Street Area Name_Central Area Name_Devonshire Area Name_Foothill ... Descent Code_F Descent Code_G Descent Code_H Descent Code_O Descent Code_P Descent Code_W Descent Code_X Arrest Type Code_F Arrest Type Code_I Arrest Type Code_M
0 2015.0 29 34.1016 -118.3350 8 5 0 0 0 0 ... 0 0 0 1 0 0 0 1 0 0
1 2015.0 25 34.1016 -118.3350 8 5 0 0 0 0 ... 0 0 0 0 0 0 0 1 0 0
2 2115.0 42 34.0838 -118.3614 13 8 0 0 0 0 ... 0 0 0 0 0 0 0 1 0 0
3 1045.0 26 33.9428 -118.4054 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 1 0 0
4 5.0 27 34.1667 -118.3965 1 4 0 0 0 0 ... 0 0 0 0 0 0 0 1 0 0

5 rows × 42 columns

In [42]:
# Target (y)
Y = df_arrests['target']
In [43]:
Y.head()
Out[43]:
0    1
1    1
2    1
3    1
4    1
Name: target, dtype: int64
In [46]:
# Train test split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, stratify = Y, test_size = 0.25)
In [47]:
# Create model
model_01 = LogisticRegression()
In [48]:
model_01.fit(X_train, Y_train)
Out[48]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)
In [51]:
# Evaluate accuracy 
acc_train = model_01.score(X_train, Y_train)
print(f'Train accuracy: {round(acc_train, 2) * 100}')

acc_test = model_01.score(X_test, Y_test)
print(f'Test accuracy: {round(acc_test, 2) * 100}')
Train accuracy: 89.0
Test accuracy: 88.0

The accuracy is quite good

In [53]:
# Extracting the coeficients
beta = model_01.coef_[0]
In [54]:
model_01.coef_[0]
Out[54]:
array([ 5.42813279e-04, -4.39057310e-02, -5.87499278e-01, -1.63918853e-01,
       -4.69835427e-03,  2.51708653e-02,  1.00283280e-01,  2.55536941e-01,
       -2.69077267e-01, -3.00366987e-01,  2.80027493e-01, -1.54515871e-01,
        2.32027004e-01, -2.88713383e-01, -2.08952561e-01, -9.62976074e-02,
       -1.19109643e-01, -2.58349585e-02,  8.01728302e-01, -2.44371498e-01,
        1.42997356e+00,  2.63020935e-02, -1.81512868e-01, -8.49705021e-01,
        9.22610897e-03, -3.00631283e-01, -9.20285196e-02,  5.70914016e-01,
       -5.66926706e-01, -7.25988136e-02,  9.81864938e-01,  4.96319703e-03,
        9.15791505e-03,  0.00000000e+00,  2.28613040e-02, -4.80238751e-01,
        0.00000000e+00, -4.62022479e-01,  0.00000000e+00,  3.99179124e+00,
       -3.03612676e+00, -9.51677171e-01])
In [55]:
# Attributes
attributes = list(X.columns)
In [56]:
# Creating a dataframe matching attributes to their coefficients
beta_df = pd.DataFrame({'attributes': attributes, 'beta': beta})
beta_df.set_index('attributes', inplace = True)
In [58]:
beta_df.head()
Out[58]:
beta
attributes
Time 0.000543
Age -0.043906
latitude -0.587499
longitude -0.163919
store_1_mile -0.004698
In [59]:
# Extracting the log of the odds
beta_df['log_odds'] = beta_df['beta'].map(lambda x: np.exp(x))
In [60]:
# Descending order
beta_df = beta_df.sort_values('log_odds', ascending = False)
In [61]:
beta_df.head(10)
Out[61]:
beta log_odds
attributes
Arrest Type Code_F 3.991791 54.151801
Area Name_Southeast 1.429974 4.178589
Descent Code_B 0.981865 2.669430
Area Name_Pacific 0.801728 2.229391
Sex Code_F 0.570914 1.769884
Area Name_Harbor 0.280027 1.323166
Area Name_Central 0.255537 1.291155
Area Name_Hollywood 0.232027 1.261154
Area Name_77th Street 0.100283 1.105484
Area Name_Southwest 0.026302 1.026651
In [62]:
# Checking if the coeficients for the distance variables: school
beta_df[beta_df.index == 'school_1_mile']
Out[62]:
beta log_odds
attributes
school_1_mile 0.025171 1.02549
In [63]:
# Checking if the coeficients for the distance variables: dispensary
beta_df[beta_df.index == 'store_1_mile']
Out[63]:
beta log_odds
attributes
store_1_mile -0.004698 0.995313
In [65]:
# Extracting the probabilities to make a 'confidence plot'
prob_arrest_sale = model_01.predict_proba(X_test)[:,1]
In [66]:
prob_arrest_sale
Out[66]:
array([0.98259978, 0.13338748, 0.02145263, ..., 0.22214757, 0.03333186,
       0.95810141])
In [68]:
# Create dataframe with actual values versus predicted probabilities (1 = sale arrest)
pred_df = pd.DataFrame({'true_values': Y_test, 'predicted_values':prob_arrest_sale})
In [69]:
pred_df
Out[69]:
true_values predicted_values
2332 1 0.982600
922 0 0.133387
2048 0 0.021453
3948 1 0.974726
1371 0 0.010254
... ... ...
2299 1 0.935365
4812 1 0.978600
744 0 0.222148
1261 0 0.033332
4980 1 0.958101

1557 rows × 2 columns

In [72]:
# Plotting the probability distribuition

# Figure
plt.figure(figsize = (18,10))

# Histogram of real values x predicted values for class 1
plt.hist(pred_df[pred_df['true_values'] == 1]['predicted_values'], 
         bins = 30, 
         color = 'r', 
         alpha = 0.6, 
         label = 'Marijuana Sale Arrest')

# Histogram of real values x predicted values for class 0
plt.hist(pred_df[pred_df['true_values'] == 0]['predicted_values'], 
         bins = 30, 
         color = 'blue', 
         alpha = 0.6, 
         label = 'Non-Marijuana Sale Arrest')

# Center line to split the plot
plt.vlines(x = 0.5, ymin = 0, ymax = 200, color = 'green', linestyle = '--')

# Labels and Legend
plt.title('\nProbability Distribution\n', fontsize = 30)
plt.ylabel('\nFrequency\n', fontsize = 30)
plt.xlabel('\nPredicted Probability of Arrest for Marijuana Sale\n', fontsize = 24)
plt.legend(fontsize = 24);

Arrest Probability x Nearby Police Department

Does having a police department nearby alters the result?

In [73]:
# Extracting the coeficients per area name
beta_df_area = beta_df[beta_df.index.str.contains('Area Name')].head(8)
In [77]:
beta_df_area.head()
Out[77]:
beta log_odds
attributes
Area Name_Southeast 1.429974 4.178589
Area Name_Pacific 0.801728 2.229391
Area Name_Harbor 0.280027 1.323166
Area Name_Central 0.255537 1.291155
Area Name_Hollywood 0.232027 1.261154

The Los Angeles departments are registered here:

http://www.lapdonline.org/our_communities/content_basic_view/6279

Manually extacting their geolocation coordinates:

In [78]:
# Create dataframe for geolocation data of LAPD
LAPD = pd.DataFrame()
In [79]:
LAPD['latitude'] = [33.9383761,
                       34.097986,
                       33.9920067,
                       34.0443028,
                       33.7584097,
                       34.050264,
                       34.1195162,
                       34.1842023]
In [80]:
LAPD['longitude'] = [-118.2749244,
                        -118.331013,
                        -118.4199295,
                        -118.4509833,
                        -118.2880336,
                        -118.291531,
                        -118.2497385,
                        -118.3021552]
In [81]:
# Plot for class 1 with main attributes versus marijuana arrests

# Figure
fig, ax = plt.subplots(figsize = (18,10))

# Plot
plt.scatter(df_arrests[df_arrests['target'] == 1]['latitude'], 
            df_arrests[df_arrests['target'] == 1]['longitude'], 
            s = 15, 
            alpha = 0.4,
            label = 'Arrests for Marijuana Sale')

# Títulos, labels e legenda
ax.scatter(LAPD['latitude'], LAPD['longitude'], color = 'red', label = 'Los Angeles Police Departments', s = 100)
plt.title('\nArrests for Marijuana Sale x LAPD Offices\n', fontsize = 24)
plt.xlabel('\nLatitude\n', fontsize = 20)
plt.ylabel('\nLongitude\n', fontsize = 20)
ax.legend(fontsize = 20)
plt.tight_layout()
In [82]:
# Plot for class 1 with main attributes versus marijuana arrests

# Figure
fig, ax = plt.subplots(figsize = (18,10))

# Plot
plt.scatter(df_arrests[df_arrests['target'] == 1]['latitude'], 
            df_arrests[df_arrests['target'] == 1]['longitude'], 
            s = 15, 
            alpha = 0.4,
            label = 'Arrests for Other Reasons')

# Títulos, labels e legenda
ax.scatter(LAPD['latitude'], LAPD['longitude'], color = 'red', label = 'Los Angeles Police Departments', s = 100)
plt.title('\nArrests for Other Reasons x LAPD Offices\n', fontsize = 24)
plt.xlabel('\nLatitude\n', fontsize = 20)
plt.ylabel('\nLongitude\n', fontsize = 20)
ax.legend(fontsize = 20)
plt.tight_layout()

Conclusion

The fear of exposing children to marijuana doesn't seem to be a factor. There are two classes: selling and not selling marijuana, and the argument was that marijuana sales expose more children to it. The variable school_1_mile é the count of schools which are whithin a 1 mile range of each arrest. Given the log chances of school_1_mile being roughly equal to one, having a school in a 1 mile range from an arrest is as likely as the arrest being related to the sale as to other causes.