In terms of police reporting, Orlando is a large metropolis that likes to pretend it’s still a small town. What I mean by this is that the Orlando Police Department files and stores police dispatches on everything that officers are called on for (except minor traffic stops). This means that Orlando often ranks disproportionately high in crime lists that are based on the number of reports per capita. It also means that we have plenty of data to look through.
Rather than choosing a default set, I asked if anyone in Orlando had a public dataset that they wanted analyzed. Someone in our Code for Orlando brigade sent me a CSV of around 1.5 million Orlando Police Dept. dispatches.
Before importing the dataset into R, I wanted to split the datetime column into its elements and add the header line to it. I ended up using this Python code in a terminal.
lines = []
newlines = []
with open('opddata.csv' , 'r') as fin: lines = fin.readlines()
#Split datetime into columns
for line in lines:
line = line.split(',')
newline = line[0].strip('"')
for item in ['-',' ',':']: newline = newline.replace(item , ',')
newline = line[0] + ',' + newline + ',' + ','.join(line[1:])
newlines.append(newline)
#Add header line
header = 'datetime,year,month,day,hour,minute,second,lat,lon,reason,agency\n'
with open('opddatasplit.csv' , 'w') as fout:
fout.write(header + ''.join(newlines))
Now we can setup our workspace and load it into R.
## datetime year month
## 2010-09-15 13:01:00: 14 Min. :2009 Min. : 1.000
## 2011-09-17 00:23:00: 12 1st Qu.:2011 1st Qu.: 4.000
## 2011-09-21 00:21:00: 12 Median :2013 Median : 7.000
## 2014-10-22 11:02:00: 12 Mean :2012 Mean : 6.522
## 2014-11-25 11:54:00: 12 3rd Qu.:2014 3rd Qu.: 9.000
## 2010-03-22 21:40:00: 10 Max. :2015 Max. :12.000
## (Other) :1448486
## day hour minute second
## Min. : 1.00 Min. : 0.00 Min. : 0.00 Min. : 0.0000
## 1st Qu.: 8.00 1st Qu.: 8.00 1st Qu.:15.00 1st Qu.: 0.0000
## Median :16.00 Median :14.00 Median :30.00 Median : 0.0000
## Mean :15.81 Mean :12.96 Mean :29.52 Mean : 0.6054
## 3rd Qu.:23.00 3rd Qu.:18.00 3rd Qu.:45.00 3rd Qu.: 0.0000
## Max. :31.00 Max. :23.00 Max. :59.00 Max. :59.0000
##
## lat lon reason
## Min. :-34.53 Min. :-88.0324 general disturbance:135448
## 1st Qu.: 28.50 1st Qu.:-81.4359 accident :120913
## Median : 28.53 Median :-81.3878 suspicious person :109178
## Mean : 28.52 Mean :-81.3889 battery : 69508
## 3rd Qu.: 28.55 3rd Qu.:-81.3481 unknown trouble : 69184
## Max. : 50.83 Max. : -0.2423 commercial alarm : 67645
## (Other) :876682
## agency
## ocso: 29624
## opd :1418934
##
##
##
##
##
## datetime year month day hour minute second lat lon
## 1 2009-05-09 12:37:00 2009 5 9 12 37 0 28.54386 -81.39834
## reason agency
## 1 battery opd
## datetime year month day hour minute second lat
## 1448558 2015-08-21 20:46:25 2015 8 21 20 46 25 28.53131
## lon reason agency
## 1448558 -81.14496 house/bus./area/check ocso
Yes, this dataset has 1.45 million rows of police dispatchess dating from 2009-05-09 to 2015-08-21. Looking at the datetime items, we can make some initial observations and conjectures.
Let’s start by looking at the times when the incidents are reported. We’ll look at year, month, day, and hour; there’s nothing valuable we can gain from minute and second.
Most years, the bin count is pretty stable just over 200K. We have incomplete data for 2009 and 2015. However, there’s a sizable spike in dispatches in 2014. That’s something to investigate later.
We can see there is, in fact, an increase in dispatchess during the Summer months and drops back down to normal in September. This is likely due to having no data before April 2009 and after August 2015. Even still, there’s a noticeable drop during December matched only by February, which is usually three days shorter, and we only have four years of data for each. I’d like to see that separated out by year.
It seems we’ve found why there was an up-tick in the summer and in 2014: there were about twice as many dispatches as normal in 2014 from April to November. There’s also a spike in August 2015, a month in which we only have 2/3 of the supposed data. Was crime rampant during these months. What I think is more likely is there is a new ‘reason’ that caused the spike or a police policy that led to officers responding to more incidents.
Turns out the number of daily dispatches is fairly steady with the median staying around 625 and the range of the middle 50% of values staying around 125. The outliers also seem to form somewhat distinct bands. I want to look at this again later.
Here’s a clear view of the hourly dispatches. We can see that the graph mostly follows a parabolic arc starting at 5 AM and peaking in the early evening. The spike at 6 PM is likely due to rush hour accidents getting reported. I’m interested why there’s a drop just before it, though.
Now I want to look more at the ‘reason’ column. We have 153 of them, and I’d like to classify them into a couple larger categories.
Also, a quick note. These are the reasons the police officer was called to the scene. While I will look at this data and make assumptions about the actual outcome, not all of these dispatches likely match one-to-one with the actual events.
## [1] "911 emergency"
## [2] "911 hang up"
## [3] "911 non-emergency"
## [4] "abandoned boat"
## [5] "abandoned vehicle"
## [6] "accident"
## [7] "aggravated assault"
## [8] "aggravated battery"
## [9] "airplane accident"
## [10] "ambulance escort"
## [11] "animal calls"
## [12] "armed robbery"
## [13] "arson fire"
## [14] "assist fire dept."
## [15] "attempted rape"
## [16] "attempted suicide"
## [17] "bad check passed"
## [18] "bank alarm"
## [19] "bank robbery"
## [20] "battery"
## [21] "batt. on law enf. off."
## [22] "bike patrol"
## [23] "bomb explosion"
## [24] "bomb threat"
## [25] "bribery"
## [26] "burglary business"
## [27] "burglary hotel"
## [28] "burglary residence"
## [29] "burglary vehicle"
## [30] "carjacking"
## [31] "check well being"
## [32] "child abuse"
## [33] "child neglect"
## [34] "citizen assist"
## [35] "commercial alarm"
## [36] "commercial b&e"
## [37] "commercial robbery"
## [38] "community orientated policing detail"
## [39] "county ord. viol."
## [40] "criminal mischief"
## [41] "dead animal"
## [42] "dead person"
## [43] "designated patrol area"
## [44] "deviant sexual activities"
## [45] "direct traffic"
## [46] "disabled occupied vehicle"
## [47] "discharge weapon"
## [48] "domestic disturbance"
## [49] "door alarm"
## [50] "d.p.a. available"
## [51] "drowning"
## [52] "drug violation"
## [53] "drunk driver"
## [54] "drunk pedestrian"
## [55] "drunk person"
## [56] "escaped prisoner"
## [57] "false imprisonment"
## [58] "felony"
## [59] "felony drugs"
## [60] "fire"
## [61] "fishing violation"
## [62] "forgery"
## [63] "found property"
## [64] "fraud/counterfeit"
## [65] "fugitive from justice"
## [66] "gambling"
## [67] "general disturbance"
## [68] "general investigation"
## [69] "grand theft"
## [70] "hit and run"
## [71] "hitchhiker"
## [72] "hold-up alarm"
## [73] "home invasion"
## [74] "house/bus./area/check"
## [75] "house/business check"
## [76] "illegal fishing"
## [77] "illegally parked cars"
## [78] "impersonating police officer"
## [79] "industrial accident"
## [80] "k-9 requested"
## [81] "kidnapping"
## [82] "law enforcement officer escort"
## [83] "leo escort"
## [84] "liquor law violation"
## [85] "lost/found property"
## [86] "man down"
## [87] "mentally-ill person"
## [88] "misd. drugs"
## [89] "misdemeanor"
## [90] "missing person"
## [91] "missing person recovered"
## [92] "murder"
## [93] "mutual aid"
## [94] "near drowning"
## [95] "noise ordinance violation"
## [96] "non-emergency assistance"
## [97] "non-so warrant"
## [98] "nuisance animal"
## [99] "obscene/harassing phone calls"
## [100] "obstruction on highway"
## [101] "obstruct on hwy"
## [102] "officer with prisoner"
## [103] "open door/window"
## [104] "other sex crimes"
## [105] "parking violation"
## [106] "person robbery"
## [107] "petit theft"
## [108] "physical fight"
## [109] "prostitution"
## [110] "prowler"
## [111] "rape"
## [112] "reckless boat"
## [113] "reckless driver"
## [114] "reckless vehicle"
## [115] "rescue-medical only"
## [116] "residential alarm"
## [117] "residential b&e"
## [118] "resist w/o violence"
## [119] "school zone crossing"
## [120] "security checkpoint alarm"
## [121] "shoplifting"
## [122] "sick or injured person"
## [123] "signal out"
## [124] "solicitor"
## [125] "stalking"
## [126] "standby"
## [127] "stolen/lost tag"
## [128] "stolen/lost tag recovered"
## [129] "stolen vehicle"
## [130] "stolen vehicle recovered"
## [131] "strong arm robbery"
## [132] "suicide"
## [133] "suspicious boat"
## [134] "suspicious car/occupant armed"
## [135] "suspicious hazard"
## [136] "suspicious incident"
## [137] "suspicious luggage"
## [138] "suspicious person"
## [139] "suspicious vehicle"
## [140] "suspicious video"
## [141] "theft"
## [142] "threatening animal"
## [143] "threats/assaults"
## [144] "traffic light"
## [145] "traffic (misc)"
## [146] "trash dumping"
## [147] "trespasser"
## [148] "unknown trouble"
## [149] "vandalism/criminal mischief"
## [150] "vehicle accident"
## [151] "vehicle alarm"
## [152] "verbal disturbance"
## [153] "weapons/armed"
Given these levels, I think the best categories will be:
The items put into each category in the code below are at my discretion. However, I used the definition of violent crime from the Bureau of Justice Statistics as my guide for the first two lists.
Violent crime involves intentional or intended physical harm to another human including murder, rape and sexual assault, robbery, and assault.
Many police departments also include attempted violent crime as violent crime as well as crimes like arson where bodily harm is possible. This is why robbery (victims present) is a violent crime while burglary (victims not present) is not. I’ll also state that, for the purpose of these lists, ‘crime’ is breaking federal or state laws, not county ordinances, so reasons that include ‘violation’, which mostly apply to local ordinances, will be put in the ‘oncall’ list.
violent_list = c('aggravated assault','aggravated battery','armed robbery','arson fire','attempted rape','bank robbery','battery','batt. on law enf. off.','bomb explosion','bomb threat','carjacking','child abuse','child neglect','commercial robbery','drunk driver','false imprisonment','hit and run','hold-up alarm','home invasion','kidnapping','murder','other sex crimes','person robbery','rape','strong arm robbery','threats/assaults','weapons/armed')
nonviolent_list = c('bad check passed','bribery','burglary business','burglary hotel','burglary residence','commercial b&e','criminal mischief','drug violation','drunk pedestrian','drunk person','escaped prisoner','felony','felony drugs','forgery','fraud/counterfeit','fugitive from justice','gambling','grand theft','illegal fishing','impersonating police officer','misd. drugs','misdemeanor','petit theft','prostitution','residential b&e','resist w/o violence','shoplifting','theft','vandalism/criminal mischief')
transport_list = c('abandoned boat','abandoned vehicle','accident','airplane accident','burglary vehicle','disabled occupied vehicle','illegally parked cars','obstruction on highway','obstruct on hwy','parking violation','reckless boat','reckless driver','reckless vehicle','signal out','stolen/lost tag','stolen/lost tag recovered','stolen vehicle','stolen vehicle recovered','suspicious boat','suspicious car/occupant armed','suspicious vehicle','traffic light','traffic (misc)','vehicle accident','vehicle alarm')
oncall_list = c('911 emergency','911 hang up','animal calls','attempted suicide','bank alarm','check well being','commercial alarm','county ord. viol.','dead animal','dead person','deviant sexual activities','discharge weapon','domestic disturbance','door alarm','drowning','fire','fishing violation','found property','general disturbance','general investigation','hitchhiker','house/bus./area/check','house/business check','industrial accident','liquor law violation','lost/found property','mentally-ill person','missing person','missing person recovered','near drowning','noise ordinance violation','non-emergency assistance','non-so warrant','nuisance animal','obscene/harassing phone calls','open door/window','physical fight','prowler','rescue-medical only','residential alarm','security checkpoint alarm','sick or injured person','solicitor','stalking','suicide','suspicious hazard','suspicious incident','suspicious luggage','suspicious person','suspicious video','threatening animal','trash dumping','trespasser','unknown trouble','verbal disturbance')
Now that we have our list, let’s make a new column called ‘reason_cat’ that tells us which category that dispatch belongs to and take a quick look at the distribution of our reason categories.
Over half of the dispatches fall into the ‘oncall’ category, which makes sense. Police are often called upon to make official reports of an incident or act as a government liaison for certain events. That category also has the most individual reasons. I’d like to see the most frequent items in these categories.
## battery threats/assaults hit and run person robbery
## 69508 30264 12272 6365
## hold-up alarm other sex crimes child neglect rape
## 4690 3874 3273 1955
## child abuse drunk driver
## 1721 1122
Of our violent crimes, half of them are for battery. In this category, 97% of our dispatches fall into the top 10 of the 27 reasons. Also, there are only 12 murder dispatches. This seems uncharacteristically low for a span of six years. It’s possible that police respond to certain calls that end up as a murder incident rather than responding after the murder has already happened.
## theft residential b&e
## 40975 37225
## shoplifting vandalism/criminal mischief
## 25822 17320
## fugitive from justice drug violation
## 15763 13931
## commercial b&e fraud/counterfeit
## 7474 6704
## drunk pedestrian burglary residence
## 3802 680
Similarly, 97% of non-violent dispatches are also made up of the top 10 of 29.
## accident suspicious vehicle
## 120913 27244
## burglary vehicle stolen vehicle
## 26980 17872
## disabled occupied vehicle obstruction on highway
## 13159 11506
## illegally parked cars abandoned vehicle
## 9387 4812
## signal out stolen vehicle recovered
## 3445 3428
Accidents make up half of our transport dispatches and are the second most common reason making up 8.3% of our dataset. Again, 97% of this category is made up of the top 10 of 25.
## general disturbance suspicious person
## 135448 109178
## unknown trouble commercial alarm
## 69184 67645
## trespasser suspicious incident
## 54349 40917
## residential alarm house/business check
## 40280 39522
## domestic disturbance noise ordinance violation
## 35888 26418
Now to our largest group. General disturbances are the most numerous reason making up 17.2% of this category and 9.4% of our dataset. We also have ‘unknown’ for 4.8% of our dataset. This category is a little more spread out with the top 10 making up only 78.6% of the 55 reasons.
Armed with this new column, let’s take another look at our hourly graph. This time, we’ll divide each bar by category.
For the most part, each category rises and falls with the overall arc of the day as we saw before. I have an idea that might explain what we see at 5 PM and 6 PM.
The time associated with a police report is not when the incident actually happened; it’s when the report is filed, ie when the officer arrives at that location. The heaviest rush hour traffic starts around 5 PM when most people leave work. I believe that many of the 6 PM dispatches happened in the 5 PM block, but the traffic kept enough officers from getting to the site promptly. If you average both bars in the graph, they fit the arc we would expect to see.
Also, there are increases in the height of ‘other’ at 7 AM and 2 PM. Because of the timing and that ‘other’ mostly consists of non-incident police activities, I believe these increases are do to the public school system beginning and ending during those times.
Let’s see if that spike in August 2015 is related to the categories.
There it is. There was a dramatic increase in the ‘oncall’ category. However, there are also smaller increases in every other category as well. Given that our data for August is only 2/3 complete, there was definitely either an increase in overall police activity or a policy change that lead to more police dispatches.
What about that increase in 2014?
These columns look very similar to the one in August 2015. It could be that they share the same cause. However, I believe there is something else going on here. The increase isn’t strictly during the Summer; it starts in April and goes through November. Rather than either/or, I believe there was both a policy change and an increase in law enforcement presence. Why? The increase in reporting matches up to the election season. While the president wasn’t on the ballot, the state governor was. However, we don’t see a similar increase in 2012.
Because each row comes with a datetime string, we can use R to determine on which day of the week it was filed.
Let’s take another look at those categories by day of the week.
That’s flatter than I thought it would be. There are slightly less on Sundays, but not by much. Maybe we’ll see something if it’s faceted (we’ll exclude ‘oncall’ from this).
Violent crime stays steady throughout the week, while three of the five categories see drops over the weekend. This is likely to do with officer prioritization. A department only has so many officers to send places especially on weekends when some officers have a day off. Violent crimes take precedent, so they see relatively little fluctuation. The other categories are responded to based on the officers who are left. However, ‘oncall’ actually increased on Friday and Saturday. It’s likely that some of the reasons in the category are not as time-dependent, so they are pushed to the weekend.
Let’s revisit that day boxplot, but we’ll use points and color by day of the week this time.
There doesn’t seem to be a connection between day of the week and the number of dispatches per day, but we can see why the bands of outliers exist in our boxplot. The number of dispatches are mostly consistent within each year except for 2014. When included with the other years, almost all of 2009 is considdered an outlier. In 2014, there are two distinct bands which likely has to do with the jumps in numbers from April to November. What I’m shocked to see is just how abrupt the changes are per year. For example, there are only a couple of days in 2010 that even fall into the range of 2009. This makes me think that a new police policy took effect at the very start of 2010 that had an immediate impact in the number of daily dispatches.
Let’s try to make some heat maps from our geo data. We know there are some outliers in the coordinates, so let’s figure out a better bounding box. I know from an older project the approximate bounds of Orange County, FL. Let’s start there with a decent buffer zone.
Now refining those values, we’ll use (28.34,-81.6),(28.64,-81.2) as our bounding box. Let’s create a subset of our data so we can round our coordinates. We’ll also create a function that will turn the dataframe into a frequency table we can use in the visualizations. However, there’s a caveat in the data. There’s an high number of reports that are located at or in the immediate vicinity of the police station and the county courthouse which are causing the rest of the data points to be washed out. For the purpose of making these plots, we will also omit these two locations. We’ll do this by supplying a frequency cap to our function.
Now for our visualization. We’ll be using the ggmap package to overlay our frequency table on a map of Orlando (sourced from Google Maps).
We can see that the darkest areas are downtown, along E & W Colonial Dr, and around shopping areas like the Millenia Mall. All of these areas are either highly populated or highly trafficked during the day. There’s also a couple of hot spots around intersections which are likely due to accidents.
I’d like to break it down into just violent and non-violent.
First, the locations with the greatest frequency of violent crime are:
These areas are centered around nightlife or are in low income neighborhoods. The outlier here is the Florida Hospital. I believe this is a similar situation to the police station where reports are filed at the hospital because the victim has already been rushed away from the scene for care.
Now, the locations with the greatest frequency of non-violent crime are:
With a few repeats, most of these areas are commercial shopping plazas. This makes sense because our non-violent crime category is dominated by types of theft including shoplifting. I do find it interesting that the Universal Studios employee parking lot was so red. It’s likely that the guest parking lots have more incidents, but they’re handled by park security.
I’d like to just look at our accident dispatches. I want to see what the least safe datetime is and the locations with the most accidents. Let’s start with accidents by hour.
We see that accidents also follow the 5 AM arc we saw earlier and have the 5 PM response time dip. We can also see that the number of accidents decreases just after morning rush hour.
The number of accidents is only about 75% as high on the weekends. Most accidents happen on Friday. I bet this is because of people going out or traveling on Friday evening, and we’ll see that with a faceted graph.
As I thought, we see an overall increase in accidents on Friday as people leave work. We get a second large spike in accidents at 4 PM as some people try to leave work early. As for the safest time, that would be at 4 AM on Thursday. We actually see the most early morning accidents on the weekend.
Now let’s see where the most accidents are.
## lat lon Freq
## 12826 28.494 -81.459 1823
## 18784 28.495 -81.436 1016
## 12815 28.483 -81.459 978
## 34342 28.513 -81.376 967
## 51404 28.481 -81.310 913
## 51447 28.524 -81.310 907
## 51420 28.497 -81.310 878
## 12794 28.462 -81.459 858
## 12308 28.494 -81.461 827
## 12846 28.514 -81.459 811
## 20333 28.490 -81.430 806
Looking at the heatmap, we can see the coordinates match up to the darkest areas on the heatmap and they’re all on top of road intersections.
Over the course of this analysis, I believe these plots best represent the information and findings in the dataset.
This first graph shows the number of dispatches divided by category through the course of the day. It generally shows how the level of police (and civilian) activity changes based on people’s sleep habits, work schedule (and the resulting rush hours), and public school day. This graph also allowed me to figure out that “police in traffic” is likely the cause of the value changes we see at 5 PM and 6 PM.
This heatmap shows the geo-spacial distribution of crime in the city. This is the kind of plot that best tells where additional patrols would best be utilized. There are hotspots of crime around the suspected places like downtown, shopping centers, and some low-income neighborhoods. However, there are other places of concern.
This faceted graph can best inform people when the safest time to drive is. On Fridays, for example, it might be safer to wait an extra hour at work than try to leave an hour early.
Using the same subset as the graph above, here are the most dangerous intersections in Orlando ranked by the number of accidents from April 2009 to August 2015:
As far as usable data goes, this dataset started as dispatch items with a datetime string, a reason string, and lat/lon coordinates. In other words, mostly categorical data. Attempting to use the data “as is” was not going to lead to any useful conclusions. I had to figure out ways to augment this database using the data available. Some of the new columns were created programmatically, like splitting the datetime, while some required a more “hands on” approach, like categorizing the dispatch reasons individually.
By far, I had the most trouble figuring out how to create the heatmaps. I started looking at ggplot2 with geom_map, but the smallest I could get was a blank polygon map of Florida’s counties. Then I looked at RgoogleMaps, but decided it wasn’t what I wanted. I tried (successfully) creating a heatmap by exporting the dataframe to Google Fusion Tables, but they don’t offer the color gradient overlay. It only put a solid dot at the geo-location of each, which didn’t adequately convey the data behind 1.45 million items.
Finally I found a library called ggmap which I could use with the ggplot additive layers. Specifically, using geom_tile with variable alpha levels, I was able to subset and round a dataframe to color the heatmap by location. The nice thing about ggmap is that it automatically restricts the bounds of the data displayed based on the bounds of the map. This meant I could change the zoom level without having to recreate the initial opdgeo dataframe.
I originally divided the datetime column into sections in Python but didn’t re-include it. I decided to rerun the script to add it back in after I realized that I wanted to use R to determine the day of the week, which required a POSIX-style datetime string, not the values themselves.
The most interesting part was redrawing some of the graphs after creating the reason categories. Seeing a jump of ‘oncall’ dispatches between 5 PM and 6 PM is what made me realize “the cops get stuck in traffic too” could be a valid explanation for the difference in the original plot.
One thing this analysis lacks is any sort of modeling. It could be possible to merge coordinate-based demographic data to model the number of dispatches for an area over a given period of time.
A look into the dataset shows that the Orlando police spend comparatively little time responding to actual crimes and making arrests. At least half of their time is spent either as a third-party for reporting an event or as a figure of authority to de-escalate a tense, non-criminal situation. While supposedly limited to Orlando, they often assist county and local police in smaller towns outside of the city’s official, twisted limits.
http://stackoverflow.com/questions/5234117/how-to-drop-columns-by-name-in-a-data-frame
http://stackoverflow.com/questions/11985799/converting-date-to-a-day-of-week-in-r
http://www.bjs.gov/index.cfm?ty=tp&tid=31
http://rstudio-pubs-static.s3.amazonaws.com/7433_4537ea5073dc4162950abb715f513469.html
http://www.r-bloggers.com/visualising-thefts-using-heatmaps-in-ggplot2/
https://gist.github.com/jmarhee/8530768
http://stats.stackexchange.com/questions/5007/how-can-i-change-the-title-of-a-legend-in-ggplot2