This project analyzes and visualizes crime data, clusters, and hotspots in Philadelphia, Pennsylvania. Using a daily automated pipeline, the app processes crime, weather, and census data to provide insights into crime distribution. The visualizations are designed to help public health officials and city planners, or curious citizens, to identify areas of concern and understand the factors influencing crime patterns.
The data is sourced from various public APIs and data sources, including the City of Philadelphia's crime data, NOAA's weather data, and the U.S. Census Bureau's demographic and geographic data.
The primary goal is to provide a comprehensive, interactive tool for visualizing crime data in Philadelphia. The app highlights crime hotspots and uses clustering to identify areas with similar characteristics based on crime, weather, and census data. This allows for more effective resource allocation and a deeper understanding of the relationships between environmental, demographic, and criminal factors.
The app also offers a data download feature, allowing users to access the raw data for further analysis. This is particularly useful for researchers, analysts, and city officials who need to work with the data outside of the application.
This project uses a daily pipeline to ensure the data is always up-to-date. The pipeline is split into three main parts:
The pipeline automatically fetches raw crime, weather, and census data for Philadelphia. It then performs data cleaning, feature engineering, and merging to create a unified dataset. This process includes:
- Data Collecting: Collecting data automatically with relevant APIs, and then merging the results.
- Spatial Joins: Mapping crimes to specific census tracts.
- Data Standardization: Correcting data types and handling missing values.
- Feature Creation: Calculating new features like population density.
The merged dataset is fit with UMAP and HDBSCAN to both reduce the dimensionality of the data, and then cluster the data. This process involves:
- Hyperparameter Optimization: Using a TPE algorithm to find the best clustering parameters with the help of MLFlow (I connect with my personal DataBricks workspace, where the runs are actually executed and recorded).
- Clustering: Applying the best parameters to group crime data points into distinct clusters. I then subset to the clusters with the highest points of association (since I could end up displaying over 100 clusters depending on the daily data - This is a little cheating though, but it's what I chose to avoid severe overplotting).
- Hotspot Analysis: Identifying areas with statistically significant crime activity.
The processed data is then used to generate a rich, interactive map using the Folium library. The map includes several layers that can be toggled on and off:
- Recent Crimes: One layer has individual markers for the most recent crime incidents. There is also another supplementary layer to aggregate the raw counts in larger regions, in case the user does not want to view all the individual recent crimes.
- Clusters: One layer has boundaries for each crime cluster (to avoid having tens of thousands of individual crime plotted on the map), color-coded for easy identification. Another has a marker hovering over each cluster boundary to list some summary statistics with that cluster. Note that there will likely be days where crimes assigned to one cluster are distributed all over Philadephia - that's just the nature of the data (otherwise, crime would be easy to understand!).
- Hotspots: A layer coloring areas with red for high crime density, and blue for low crime density.
With these 3 main layers, one can view the most recent crime and see how it overlaps with pre-existing crime hotspots and clusters. The hotspots help to highlight statistically significant areas of high crime based on the raw coordinates, while the clusters take into account other aspects such as weather and census data of when/where the crime occured.
The app is hosted here: https://crime-nexus.streamlit.app/. A preview image of the app is at the top of this README.
The Streamlit app is divided into two tabs:
This tab displays the main interactive map with all the layers and legends. It provides a visual overview of crime in the city, allowing users to zoom, pan, and toggle layers for better visibility. The map is designed to be responsive and renders across the full width of the browser (though you ideally want to open the map on a widescreen device; The legend will block the rest of the map on a thinner device like a phone). I also include some summary statistics for each cluster.
This tab allows users to select a specific crime cluster and download the raw, processed data for that cluster as a CSV file. This is useful for in-depth analysis and is a key feature for enabling data-driven insights. The data is pre-processed to reverse one-hot-encoded columns, providing clean and human-readable data to undo the steps that were done during modeling. While I do include statistics for each cluster, this is on a very high level, and those interested are encouraged to take the data and do more in depth analysis. It would be impossible for me to automatically generate accurate, in depth analysis, since the data structure will change on a daily basis.
The repo is split up into folders based on each major component. In particular, the /experimental directory holds notebooks to walkthrough each major component, if one would like to get more details with my exact approach. Admittedly, there were a few changes I had to make throughout this projects, so the notebooks are slightly different than the actual scripts, but it should still be accurate enough for it's main purpose. The .github/workflows directory holds the .yml file for my GitHub Actions run. /app holds files for my Streamlit app. Finally, the /src directory holds the pipeline script, alongside a configuration and utilities file.
This project has several paths to enhance what it can show. However, given I am using free services (e.g., the free tier of GitHub Actions, Streamlit Community Cloud, alongside "free" versions of tokens), this leads to several limitations. For example, I could ingest data for several more years, but the NOAA has a token limit that prevents me from doing so. Not only that, but I could ingest more data into the clustering part of the pipeline, as well as the mapping. However, there is a hard time limit of 6h for the free version of GitHub Actions, and the pipeline would exceed that time limit. Finally, Streamlit Community Cloud has resource limits (which lead to times where data for a particular data was too large, and led to instability and app crashes), so all in all, I had to limit the amount of data to make the whole project run smoothly in each step. This is also why I have to do my hotspot analysis in the pipeline, since Streamlit will crash if I try to do the analysis in the app directory. BUT, this does mean that these are easy improvements if I were willing to pay for more resources.
Additionally, I did consider making a crime prediction model, which would be similar to the hotspot analysis, but also considering factors such as the number of bars in a grid cell, and so on. But, there are several ethical concerns with crime prediction. In particular, a model would just learn the pattern of where crime is reported, and not where crime is actually happening. Crime is likely underreported in some areas compared to others, and a prediction model would just learn that pattern, thereby enhancing existing discrimination. There are several resources to explore on this topic; I found this one for LA policing decisions to be a good starting point: https://vce.usc.edu/volume-5-issue-3/pitfalls-of-predictive-policing-an-ethical-analysis/.
Though, one thing I would like to do is improve the cluster outlines on the map. The current strategy is pretty effective, but many, if not most/all days, will end up with at least 1-2 outlines extended beyond the boundaries of Philadelphia (e.g. at least one crime in West Philaldephia is in the same cluster as one in North Philadelphia), which can be misleading since users may think that there are crimes recorded that are outside of Philadelphia, since there will be a long diagonal between both parts of Philadelphia. This is a minor QOL change I hope to implement later.
To conclude this point, while this project is to some degree, also falling down this pitfall, I am only displaying the crimes that were recorded, not trying to predict crimes. Many of these points are with crime themselves, but the map should be left with some level of scrutiny.
The codebase is structured to be modular and extensible, serving as a solid foundation for these future enhancements, if I ever decide to go forward with these suggestions.
| Column Name | Data Type | Description |
|---|---|---|
| dc_dist | string | The police district where the incident occurred. |
| psa | string | The Police Service Area (PSA), a smaller geographic subdivision of a district. |
| dispatch_date | string | The date the call for service was dispatched (YYYY-MM-DD). |
| dispatch_time | string | The time the call for service was dispatched (HH:MI:SS). |
| hour | integer | The hour of the day the call was dispatched (0–23). |
| text_general_code | string | The text description for the type of incident (e.g., "Theft", "Assault"). |
| location_block | string | The street address of the incident, anonymized to the block level. |
| point_y | float | The latitude coordinate for the incident location. |
| point_x | float | The longitude coordinate for the incident location. |
| Column | Data Type | Description |
|---|---|---|
| date_dt | string | The date of the weather observation. |
| avg_wind_speed_mph | float | Average Wind Speed (miles per hour). |
| precipitation_inches | float | Precipitation (inches). |
| snowfall_inches | float | Snowfall (inches). |
| snow_depth_inches | float | Snow Depth (inches). |
| max_temp_f | float | Maximum Temperature (°F). |
| min_temp_f | float | Minimum Temperature (°F). |
| Column | Data Type | Description |
|---|---|---|
| tract_fips | string | The FIPS code for the census tract. |
| pop_total | float | Total Population. |
| income_median | integer | Median Household Income. |
| median_age | float | Median Age. |
| poverty_rate | float | Percentage of people below the poverty line. |
| vacancy_rate | float | Percentage of vacant housing units. |
| renter_occupancy_rate | float | Percentage of renter-occupied housing units. |