Data-driven Analysis for Receipt Scanning Application

Introduction

Name: Chi-Ying(Rebekah)Chuang

School: University of Illinois at Urbana-Champaign

Degree: Master of Science

Program: Information Management

Expected Graduation: May 2024

About The Project

The following csv files are the 4 given original sample data files:

Path: Fetch_DA_2023_intern/Original csv files/

The goal of this project is to:

Review CSV data and diagram a new structured relational data model
Generate a query that answers a predetermined business question(s)
Choose something noteworthy to share with a non-technical stakeholder

Skills: Python(NumPy, Pandas), SQL, Tableau

Clean the data

First, I cleaned the data using Python(NumPy, Pandas). I dropped useless columns, converted datatype, and saved them into new files. For more details about how I clean the data, please see the attached Jupyter Notebook.

Path: Fetch_DA_2023_intern/Clean Data.ipynb

Clean Data.ipynb

When saving new files, I add a suffix _2 to differentiate them from the original files. However, I changed the file name to the original one when importing csv files into SQL Server to make it more convenient when running queries.

Path: Fetch_DA_2023_intern/New csv files/

Create a New Relational Data Model

The next step is diagraming a new structured relational data model. After cleaning the data, I created a new ER diagram(Entity-Relationship Diagram) using Lucidchart for this database(shown below).

PK stands for Primary Key

FK stands for Foreign Key

SQL Queries

After creating a new relational data model, I need to write queries to answer questions from business stakeholders. The SQL dialect I chose was MS SQL Server.

Questions 1-5 are those given by Fetch Rewards, and Questions 6-10 are those I created that might be helpful. For the original files of SQL Queries, please see attached file, but I will also leave my code and notes below!

Path: Fetch_DA_2023_intern/SQLQuery.sql

SQL Queries

Question1: Which brand saw the most dollars spent in the month of June?

WITH cte AS
(
   SELECT
       b.BARCODE,
       b.NAME AS brand_name,
       ri.TOTAL_FINAL_PRICE,
       r.PURCHASE_DATE,
       MONTH(r.PURCHASE_DATE) AS month
   FROM brands AS b
   JOIN receipt_items AS ri
   ON b.BARCODE = ri.BARCODE
   JOIN receipts AS r
   ON r.ID = ri.REWARDS_RECEIPT_ID
)

SELECT
   TOP 1 brand_name,
   SUM(TOTAL_FINAL_PRICE) AS price,
   month
FROM cte
GROUP BY brand_name, month
HAVING MONTH = 6
ORDER BY price DESC;

Question2: Which user spent the most money in the month of August?

WITH cte AS
(
    SELECT
        u.ID AS user_id,
        MONTH(r.PURCHASE_DATE) AS month,
        r.TOTAL_SPENT
    FROM users AS u
    JOIN receipts AS r
    ON u.ID = r.USER_ID
)

SELECT
    TOP 1 user_id,
    month,
    SUM(TOTAL_SPENT) AS total_spent
FROM cte
GROUP BY user_id, month
HAVING month = 8
ORDER BY total_spent DESC;

Question3: What user bought the most expensive item?

Question4: What is the name of the most expensive item purchased?

Note

Questions 3&4 can be answered using the same query. After looking at the output, I believe there might be some typo in the original file because the price of Starbucks Iced Coffee Premium Coffee Beverage Unsweetened Blonde Roast Bottle 48 Oz 1 Ct shouldn't be that high. However, since I'm not sure about the correct price and if there are any other typos, I decided to keep the original number.

WITH cte AS
(
    SELECT
        u.ID AS user_id,
        r.ID AS receipt_id,
        ri.DESCRIPTION AS item_name,
        CAST(ROUND(ri.TOTAL_FINAL_PRICE, 2) AS DECIMAL(10,2)) AS total_final_price,
        ri.QUANTITY_PURCHASED AS quantity_purchased,
        CAST(ROUND(total_final_price/quantity_purchased, 2) AS FLOAT) AS item_price
    FROM users AS u
    JOIN receipts AS r
    ON u.ID = r.USER_ID
    JOIN receipt_items AS ri
    ON r.ID = ri.REWARDS_RECEIPT_ID
    WHERE (total_final_price IS NOT NULL)
    AND (total_final_price != 0)
    AND (quantity_purchased IS NOT NULL)
    AND (quantity_purchased != 0)
)
SELECT
    TOP 1 user_id,
    item_name,
    item_price
FROM cte
ORDER BY item_price DESC;

Question5: How many users scanned in each month?

WITH cte AS
(
    SELECT
        u.ID AS user_id,
        MONTH(r.DATE_SCANNED) AS month_scanned
    FROM users AS u
    JOIN receipts AS r
    ON u.ID = r.USER_ID
)

SELECT
    CASE WHEN month = 1 THEN 'JAN'
    WHEN month = 2 THEN 'FEB'
    WHEN month = 3 THEN 'MAR'
    WHEN month = 4 THEN 'APR'
    WHEN month = 5 THEN 'MAY'
    WHEN month = 6 THEN 'JUN'
    WHEN month = 7 THEN 'JUL'
    WHEN month = 8 THEN 'AUG'
    WHEN month = 9 THEN 'SEP'
    WHEN month = 10 THEN 'OCT'
    WHEN month = 11 THEN 'NOV'
    ELSE 'DEC' END AS month,
    user_id_count
FROM
(
    SELECT
        TOP 12 month_scanned AS month,
        COUNT(user_id) AS user_id_count
    FROM cte
    GROUP BY month_scanned
    ORDER BY month
)sub;

Question6: What is the top 10 category that has the most brands?

SELECT
    TOP 10 COUNT(NAME) AS brand_count,
    CATEGORY AS category
FROM brands
GROUP BY category
HAVING category IS NOT NULL
ORDER BY brand_count DESC;

Question7: How much does each store earn in 2022 (list top 20)?

SELECT
    TOP 20 STORE_NAME AS store_name,
    ROUND(SUM(TOTAL_SPENT), 3) AS revenue
FROM receipts
WHERE YEAR(PURCHASE_DATE) = 2022
GROUP BY store_name
ORDER BY revenue DESC;

Question8: What are the top 5 states that registered users come from?

SELECT
    TOP 5 STATE AS state,
    COUNT(*) AS count
FROM users
GROUP BY state
ORDER BY count DESC;

Question9: What is the age distribution of the registered users?

WITH cte AS
(
    SELECT
        *,
        CASE WHEN age BETWEEN 0 AND 10 THEN '0-10'
        WHEN age BETWEEN 11 AND 20 THEN '11-20'
        WHEN age BETWEEN 21 AND 30 THEN '21-30'
        WHEN age BETWEEN 31 AND 40 THEN '31-40'
        WHEN age BETWEEN 41 AND 50 THEN '41-50'
        WHEN age BETWEEN 51 AND 60 THEN '51-60'
        WHEN age BETWEEN 61 AND 70 THEN '61-70'
        WHEN age BETWEEN 71 AND 80 THEN '71-80'
        END AS age_range
    FROM
    (
        SELECT
            *,
            2023 - YEAR(BIRTH_DATE) AS age
        FROM users
    ) sub
)

SELECT
    age_range,
    COUNT(*) AS age_range_count
FROM cte
GROUP BY age_range;

Question10: What is the distribution of the sign up source?

SELECT
    SIGN_UP_SOURCE AS sign_up_source,
    COUNT(*) AS source_count
FROM users
GROUP BY sign_up_source;

Insights

For the last part, I used Tableau to create visualizations and dashboards to help non-technical stakeholders easily understand my analysis.

1. Distribution of Registered Users (Map):

The visualization below shows the distribution of registered users. From this visualization, we know that Florida State has the most registered users since the color of Florida is the darkest on the whole map.

User.Count.By.State.mov

2. Rewards Receipts Status (Highlight table):

From the highlight table below, we can know the number and percentage of the receipt that is FINISHED/FLAGGED/PENDING/REJECTED/SUBMITTED. For example, among all the rewards receipts, 91.15% of them are finished(which is 64,350). The darker color represents the higher percentage/number.

Rewards.Receipts.Status.mov

3. Top 20 Stores' Revenue By Year(2020-2023) (Treemap):

I created a treemap in the following dashboard to visualize the store's revenue(top 20) each year. Users can use the slider above the dashboard to filter the year they want. The bigger the area, the larger the revenue. Users can see the corresponding store name and revenue that year when they hover over the dashboard.

Store.s.Revenue.By.Year.2020-2023.mov

4. Brand Count(By Category) (Treemap):

This visualization can be used to answer Question 6. From this treemap, we know that Beverages is the category with the most brand among all the categories. Users can see the category and brand count when they hover over the visualization.

Brand.Count.By.Category.mov

5. Scanned User(By Year/Quarter) (Bar chart):

The dashboard below can be used to answer Question 5 and helps stakeholders visualize the number of users scanned in a specific year or quarter. Users can use the slider to filter quarters and the dropdown to filter years.

Scanned.User.By.Year.Quarter.mov

Name		Name	Last commit message	Last commit date
Latest commit History 72 Commits
New csv files		New csv files
Original csv files		Original csv files
SQL questions output		SQL questions output
Visualization		Visualization
.DS_Store		.DS_Store
README.md		README.md
Relational Data Model.png		Relational Data Model.png
SQLQuery.sql		SQLQuery.sql

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data-driven Analysis for Receipt Scanning Application

Table of Contents

Introduction

About The Project

Clean the data

Create a New Relational Data Model

SQL Queries

Question1: Which brand saw the most dollars spent in the month of June?

Question2: Which user spent the most money in the month of August?

Question3: What user bought the most expensive item?

Question4: What is the name of the most expensive item purchased?

Question5: How many users scanned in each month?

Question6: What is the top 10 category that has the most brands?

Question7: How much does each store earn in 2022 (list top 20)?

Question8: What are the top 5 states that registered users come from?

Question9: What is the age distribution of the registered users?

Question10: What is the distribution of the sign up source?

Insights

1. Distribution of Registered Users (Map):

2. Rewards Receipts Status (Highlight table):

3. Top 20 Stores' Revenue By Year(2020-2023) (Treemap):

4. Brand Count(By Category) (Treemap):

5. Scanned User(By Year/Quarter) (Bar chart):

About

Releases

Packages

Languages

Rebekah-Chuang/Data-driven-Analysis-for-Receipt-Scanning-Application

Folders and files

Latest commit

History

Repository files navigation

Data-driven Analysis for Receipt Scanning Application

Table of Contents

Introduction

About The Project

Clean the data

Create a New Relational Data Model

SQL Queries

Question1: Which brand saw the most dollars spent in the month of June?

Question2: Which user spent the most money in the month of August?

Question3: What user bought the most expensive item?

Question4: What is the name of the most expensive item purchased?

Question5: How many users scanned in each month?

Question6: What is the top 10 category that has the most brands?

Question7: How much does each store earn in 2022 (list top 20)?

Question8: What are the top 5 states that registered users come from?

Question9: What is the age distribution of the registered users?

Question10: What is the distribution of the sign up source?

Insights

1. Distribution of Registered Users (Map):

2. Rewards Receipts Status (Highlight table):

3. Top 20 Stores' Revenue By Year(2020-2023) (Treemap):

4. Brand Count(By Category) (Treemap):

5. Scanned User(By Year/Quarter) (Bar chart):

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages