crime-nexus

Background

This project analyzes and visualizes crime data, clusters, and hotspots in Philadelphia, Pennsylvania. Using a daily automated pipeline, the app processes crime, weather, and census data to provide insights into crime distribution. The visualizations are designed to help public health officials and city planners, or curious citizens, to identify areas of concern and understand the factors influencing crime patterns.

The data is sourced from various public APIs and data sources, including the City of Philadelphia's crime data, NOAA's weather data, and the U.S. Census Bureau's demographic and geographic data.

Goal

The primary goal is to provide a comprehensive, interactive tool for visualizing crime data in Philadelphia. The app highlights crime hotspots and uses clustering to identify areas with similar characteristics based on crime, weather, and census data. This allows for more effective resource allocation and a deeper understanding of the relationships between environmental, demographic, and criminal factors.

The app also offers a data download feature, allowing users to access the raw data for further analysis. This is particularly useful for researchers, analysts, and city officials who need to work with the data outside of the application.

Methodology

This project uses a daily pipeline to ensure the data is always up-to-date. The pipeline is split into three main parts:

Data Retrieval and Merging

The pipeline automatically fetches raw crime, weather, and census data for Philadelphia. It then performs data cleaning, feature engineering, and merging to create a unified dataset. This process includes:

Data Collecting: Collecting data automatically with relevant APIs, and then merging the results.
Spatial Joins: Mapping crimes to specific census tracts.
Data Standardization: Correcting data types and handling missing values.
Feature Creation: Calculating new features like population density.

Clustering/Hotspot Analysis

The merged dataset is fit with UMAP and HDBSCAN to both reduce the dimensionality of the data, and then cluster the data. This process involves:

Hyperparameter Optimization: Using a TPE algorithm to find the best clustering parameters with the help of MLFlow (I connect with my personal DataBricks workspace, where the runs are actually executed and recorded).
Clustering: Applying the best parameters to group crime data points into distinct clusters. I then subset to the clusters with the highest points of association (since I could end up displaying over 100 clusters depending on the daily data - This is a little cheating though, but it's what I chose to avoid severe overplotting).
Hotspot Analysis: Identifying areas with statistically significant crime activity.

Mapping

The processed data is then used to generate a rich, interactive map using the Folium library. The map includes several layers that can be toggled on and off:

Recent Crimes: One layer has individual markers for the most recent crime incidents. There is also another supplementary layer to aggregate the raw counts in larger regions, in case the user does not want to view all the individual recent crimes.
Clusters: One layer has boundaries for each crime cluster (to avoid having tens of thousands of individual crime plotted on the map), color-coded for easy identification. Another has a marker hovering over each cluster boundary to list some summary statistics with that cluster. Note that there will likely be days where crimes assigned to one cluster are distributed all over Philadephia - that's just the nature of the data (otherwise, crime would be easy to understand!).
Hotspots: A layer coloring areas with red for high crime density, and blue for low crime density.

With these 3 main layers, one can view the most recent crime and see how it overlaps with pre-existing crime hotspots and clusters. The hotspots help to highlight statistically significant areas of high crime based on the raw coordinates, while the clusters take into account other aspects such as weather and census data of when/where the crime occured.

App Features

The app is hosted here: https://crime-nexus.streamlit.app/. A preview image of the app is at the top of this README.

The Streamlit app is divided into two tabs:

Map Viewer

This tab displays the main interactive map with all the layers and legends. It provides a visual overview of crime in the city, allowing users to zoom, pan, and toggle layers for better visibility. The map is designed to be responsive and renders across the full width of the browser (though you ideally want to open the map on a widescreen device; The legend will block the rest of the map on a thinner device like a phone). I also include some summary statistics for each cluster.

Data Downloader

This tab allows users to select a specific crime cluster and download the raw, processed data for that cluster as a CSV file. This is useful for in-depth analysis and is a key feature for enabling data-driven insights. The data is pre-processed to reverse one-hot-encoded columns, providing clean and human-readable data to undo the steps that were done during modeling. While I do include statistics for each cluster, this is on a very high level, and those interested are encouraged to take the data and do more in depth analysis. It would be impossible for me to automatically generate accurate, in depth analysis, since the data structure will change on a daily basis.

Repo Structure

The repo is split up into folders based on each major component. In particular, the /experimental directory holds notebooks to walkthrough each major component, if one would like to get more details with my exact approach. Admittedly, there were a few changes I had to make throughout this projects, so the notebooks are slightly different than the actual scripts, but it should still be accurate enough for it's main purpose. The .github/workflows directory holds the .yml file for my GitHub Actions run. /app holds files for my Streamlit app. Finally, the /src directory holds the pipeline script, alongside a configuration and utilities file.

Future Work

This project has several paths to enhance what it can show. However, given I am using free services (e.g., the free tier of GitHub Actions, Streamlit Community Cloud, alongside "free" versions of tokens), this leads to several limitations. For example, I could ingest data for several more years, but the NOAA has a token limit that prevents me from doing so. Not only that, but I could ingest more data into the clustering part of the pipeline, as well as the mapping. However, there is a hard time limit of 6h for the free version of GitHub Actions, and the pipeline would exceed that time limit. Finally, Streamlit Community Cloud has resource limits (which lead to times where data for a particular data was too large, and led to instability and app crashes), so all in all, I had to limit the amount of data to make the whole project run smoothly in each step. This is also why I have to do my hotspot analysis in the pipeline, since Streamlit will crash if I try to do the analysis in the app directory. BUT, this does mean that these are easy improvements if I were willing to pay for more resources.

Additionally, I did consider making a crime prediction model, which would be similar to the hotspot analysis, but also considering factors such as the number of bars in a grid cell, and so on. But, there are several ethical concerns with crime prediction. In particular, a model would just learn the pattern of where crime is reported, and not where crime is actually happening. Crime is likely underreported in some areas compared to others, and a prediction model would just learn that pattern, thereby enhancing existing discrimination. There are several resources to explore on this topic; I found this one for LA policing decisions to be a good starting point: https://vce.usc.edu/volume-5-issue-3/pitfalls-of-predictive-policing-an-ethical-analysis/.

Though, one thing I would like to do is improve the cluster outlines on the map. The current strategy is pretty effective, but many, if not most/all days, will end up with at least 1-2 outlines extended beyond the boundaries of Philadelphia (e.g. at least one crime in West Philaldephia is in the same cluster as one in North Philadelphia), which can be misleading since users may think that there are crimes recorded that are outside of Philadelphia, since there will be a long diagonal between both parts of Philadelphia. This is a minor QOL change I hope to implement later.

To conclude this point, while this project is to some degree, also falling down this pitfall, I am only displaying the crimes that were recorded, not trying to predict crimes. Many of these points are with crime themselves, but the map should be left with some level of scrutiny.

The codebase is structured to be modular and extensible, serving as a solid foundation for these future enhancements, if I ever decide to go forward with these suggestions.

(Rough) Data Dictionaries

Philadelphia Crime

Column Name	Data Type	Description
dc_dist	string	The police district where the incident occurred.
psa	string	The Police Service Area (PSA), a smaller geographic subdivision of a district.
dispatch_date	string	The date the call for service was dispatched (YYYY-MM-DD).
dispatch_time	string	The time the call for service was dispatched (HH:MI:SS).
hour	integer	The hour of the day the call was dispatched (0–23).
text_general_code	string	The text description for the type of incident (e.g., "Theft", "Assault").
location_block	string	The street address of the incident, anonymized to the block level.
point_y	float	The latitude coordinate for the incident location.
point_x	float	The longitude coordinate for the incident location.

Weather

Column	Data Type	Description
date_dt	string	The date of the weather observation.
avg_wind_speed_mph	float	Average Wind Speed (miles per hour).
precipitation_inches	float	Precipitation (inches).
snowfall_inches	float	Snowfall (inches).
snow_depth_inches	float	Snow Depth (inches).
max_temp_f	float	Maximum Temperature (°F).
min_temp_f	float	Minimum Temperature (°F).

Census

Column	Data Type	Description
tract_fips	string	The FIPS code for the census tract.
pop_total	float	Total Population.
income_median	integer	Median Household Income.
median_age	float	Median Age.
poverty_rate	float	Percentage of people below the poverty line.
vacancy_rate	float	Percentage of vacant housing units.
renter_occupancy_rate	float	Percentage of renter-occupied housing units.

Name		Name	Last commit message	Last commit date
Latest commit History 349 Commits
.devcontainer		.devcontainer
.github/workflows		.github/workflows
.streamlit		.streamlit
app		app
experimental		experimental
src		src
.gitattributes		.gitattributes
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
requirements_actions.txt		requirements_actions.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

crime-nexus

Background

Goal

Methodology

Data Retrieval and Merging

Clustering/Hotspot Analysis

Mapping

App Features

Map Viewer

Data Downloader

Repo Structure

Future Work

(Rough) Data Dictionaries

Philadelphia Crime

Weather

Census

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

crime-nexus

Background

Goal

Methodology

Data Retrieval and Merging

Clustering/Hotspot Analysis

Mapping

App Features

Map Viewer

Data Downloader

Repo Structure

Future Work

(Rough) Data Dictionaries

Philadelphia Crime

Weather

Census

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages