- Introduction
- About The Project
- Clean the data
- PartI: Create a New Relational Data Model
- PartII: SQL Queries
- PartIII: Insights
Name: Chi-Ying(Rebekah)Chuang
School: University of Illinois at Urbana-Champaign
Degree: Master of Science
Program: Information Management
Expected Graduation: May 2024
The following csv files are the 4 given original sample data files:
Path: Fetch_DA_2023_intern/Original csv files/
The goal of this project is to:
-
Review CSV data and diagram a new structured relational data model
-
Generate a query that answers a predetermined business question(s)
-
Choose something noteworthy to share with a non-technical stakeholder
Skills: Python(NumPy, Pandas), SQL, Tableau
First, I cleaned the data using Python(NumPy, Pandas). I dropped useless columns, converted datatype, and saved them into new files. For more details about how I clean the data, please see the attached Jupyter Notebook.
Path: Fetch_DA_2023_intern/Clean Data.ipynb
When saving new files, I add a suffix _2
to differentiate them from the original files. However, I changed the file name to the original one when importing csv files into SQL Server to make it more convenient when running queries.
Path: Fetch_DA_2023_intern/New csv files/
The next step is diagraming a new structured relational data model. After cleaning the data, I created a new ER diagram(Entity-Relationship Diagram) using Lucidchart for this database(shown below).
PK stands for Primary Key
FK stands for Foreign Key
After creating a new relational data model, I need to write queries to answer questions from business stakeholders. The SQL dialect I chose was MS SQL Server.
Questions 1-5 are those given by Fetch Rewards, and Questions 6-10 are those I created that might be helpful. For the original files of SQL Queries, please see attached file, but I will also leave my code and notes below!
Path: Fetch_DA_2023_intern/SQLQuery.sql
WITH cte AS
(
SELECT
b.BARCODE,
b.NAME AS brand_name,
ri.TOTAL_FINAL_PRICE,
r.PURCHASE_DATE,
MONTH(r.PURCHASE_DATE) AS month
FROM brands AS b
JOIN receipt_items AS ri
ON b.BARCODE = ri.BARCODE
JOIN receipts AS r
ON r.ID = ri.REWARDS_RECEIPT_ID
)
SELECT
TOP 1 brand_name,
SUM(TOTAL_FINAL_PRICE) AS price,
month
FROM cte
GROUP BY brand_name, month
HAVING MONTH = 6
ORDER BY price DESC;
WITH cte AS
(
SELECT
u.ID AS user_id,
MONTH(r.PURCHASE_DATE) AS month,
r.TOTAL_SPENT
FROM users AS u
JOIN receipts AS r
ON u.ID = r.USER_ID
)
SELECT
TOP 1 user_id,
month,
SUM(TOTAL_SPENT) AS total_spent
FROM cte
GROUP BY user_id, month
HAVING month = 8
ORDER BY total_spent DESC;
Note
Questions 3&4 can be answered using the same query. After looking at the output, I believe there might be some typo in the original file because the price of Starbucks Iced Coffee Premium Coffee Beverage Unsweetened Blonde Roast Bottle 48 Oz 1 Ct shouldn't be that high. However, since I'm not sure about the correct price and if there are any other typos, I decided to keep the original number.
WITH cte AS
(
SELECT
u.ID AS user_id,
r.ID AS receipt_id,
ri.DESCRIPTION AS item_name,
CAST(ROUND(ri.TOTAL_FINAL_PRICE, 2) AS DECIMAL(10,2)) AS total_final_price,
ri.QUANTITY_PURCHASED AS quantity_purchased,
CAST(ROUND(total_final_price/quantity_purchased, 2) AS FLOAT) AS item_price
FROM users AS u
JOIN receipts AS r
ON u.ID = r.USER_ID
JOIN receipt_items AS ri
ON r.ID = ri.REWARDS_RECEIPT_ID
WHERE (total_final_price IS NOT NULL)
AND (total_final_price != 0)
AND (quantity_purchased IS NOT NULL)
AND (quantity_purchased != 0)
)
SELECT
TOP 1 user_id,
item_name,
item_price
FROM cte
ORDER BY item_price DESC;
WITH cte AS
(
SELECT
u.ID AS user_id,
MONTH(r.DATE_SCANNED) AS month_scanned
FROM users AS u
JOIN receipts AS r
ON u.ID = r.USER_ID
)
SELECT
CASE WHEN month = 1 THEN 'JAN'
WHEN month = 2 THEN 'FEB'
WHEN month = 3 THEN 'MAR'
WHEN month = 4 THEN 'APR'
WHEN month = 5 THEN 'MAY'
WHEN month = 6 THEN 'JUN'
WHEN month = 7 THEN 'JUL'
WHEN month = 8 THEN 'AUG'
WHEN month = 9 THEN 'SEP'
WHEN month = 10 THEN 'OCT'
WHEN month = 11 THEN 'NOV'
ELSE 'DEC' END AS month,
user_id_count
FROM
(
SELECT
TOP 12 month_scanned AS month,
COUNT(user_id) AS user_id_count
FROM cte
GROUP BY month_scanned
ORDER BY month
)sub;
SELECT
TOP 10 COUNT(NAME) AS brand_count,
CATEGORY AS category
FROM brands
GROUP BY category
HAVING category IS NOT NULL
ORDER BY brand_count DESC;
SELECT
TOP 20 STORE_NAME AS store_name,
ROUND(SUM(TOTAL_SPENT), 3) AS revenue
FROM receipts
WHERE YEAR(PURCHASE_DATE) = 2022
GROUP BY store_name
ORDER BY revenue DESC;
SELECT
TOP 5 STATE AS state,
COUNT(*) AS count
FROM users
GROUP BY state
ORDER BY count DESC;
WITH cte AS
(
SELECT
*,
CASE WHEN age BETWEEN 0 AND 10 THEN '0-10'
WHEN age BETWEEN 11 AND 20 THEN '11-20'
WHEN age BETWEEN 21 AND 30 THEN '21-30'
WHEN age BETWEEN 31 AND 40 THEN '31-40'
WHEN age BETWEEN 41 AND 50 THEN '41-50'
WHEN age BETWEEN 51 AND 60 THEN '51-60'
WHEN age BETWEEN 61 AND 70 THEN '61-70'
WHEN age BETWEEN 71 AND 80 THEN '71-80'
END AS age_range
FROM
(
SELECT
*,
2023 - YEAR(BIRTH_DATE) AS age
FROM users
) sub
)
SELECT
age_range,
COUNT(*) AS age_range_count
FROM cte
GROUP BY age_range;
SELECT
SIGN_UP_SOURCE AS sign_up_source,
COUNT(*) AS source_count
FROM users
GROUP BY sign_up_source;
For the last part, I used Tableau to create visualizations and dashboards to help non-technical stakeholders easily understand my analysis.
The visualization below shows the distribution of registered users. From this visualization, we know that Florida State has the most registered users since the color of Florida is the darkest on the whole map.
User.Count.By.State.mov
From the highlight table below, we can know the number and percentage of the receipt that is FINISHED/FLAGGED/PENDING/REJECTED/SUBMITTED. For example, among all the rewards receipts, 91.15% of them are finished(which is 64,350). The darker color represents the higher percentage/number.
Rewards.Receipts.Status.mov
I created a treemap in the following dashboard to visualize the store's revenue(top 20) each year. Users can use the slider above the dashboard to filter the year they want. The bigger the area, the larger the revenue. Users can see the corresponding store name and revenue that year when they hover over the dashboard.
Store.s.Revenue.By.Year.2020-2023.mov
This visualization can be used to answer Question 6. From this treemap, we know that Beverages is the category with the most brand among all the categories. Users can see the category and brand count when they hover over the visualization.
Brand.Count.By.Category.mov
The dashboard below can be used to answer Question 5 and helps stakeholders visualize the number of users scanned in a specific year or quarter. Users can use the slider to filter quarters and the dropdown to filter years.