Skip to content

Latest commit

 

History

History
109 lines (79 loc) · 4.44 KB

README.md

File metadata and controls

109 lines (79 loc) · 4.44 KB

Summary of the Project

This project applies data warehousing with Redshift and builds an ETL pipeline using Python. The ETL pipeline is mainly responsible for following tasks:

  1. Copy data from S3 files to staging tables in Redshift
  2. Insert data from these staging tables in Redshift into our modelled fact and dimesional tables within Redshift suited for analysis

The data is for a demo startup called Sparkify where analysts would like to perform queries on user activity of songs. The data files have song data and user activity log data

Steps for running the ETL

  1. Drop and create tables before running the ETL

    python create_tables.py

  2. Run the etl

    python etl.py

  3. You can check if data has been loaded correctly using Redshift query editor in AWS console

Project stucture

  1. Data directories:

    • song_data : The first dataset is a subset of real data from the Million Song Dataset. Each file is in JSON format and contains metadata about a song and the artist of that song. The files are partitioned by the first three letters of each song's track ID. The contents of one such file looks like the following:

      {"num_songs": 1, "artist_id": "ARJIE2Y1187B994AB7", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Line Renaud", "song_id": "SOUPIRU12A6D4FA1E1", "title": "Der Kleine Dompfaff", "duration": 152.92036, "year": 0}

    • log_data: The second dataset consists of log files in JSON format generated by an event simulator based on the songs in the dataset above. These simulate activity logs from a music streaming app based on specified configurations. The log files in the dataset you'll be working with are partitioned by year and month. Following is a snapshot log_data

    Log Data preview

  2. create_tables.py - Script to drop and create tables according to data modelled. You need to make sure that this script is executed before running the ETL

  3. sql_queries.py - A collection of all DDL and DML queries used in the project

  4. etl.py - The actual ETL file which loads data from data files and inserts into Redshift by first copying to staging tables and then from staging tables to corresponding Fact and Dimension tables

Database Design

Staging tables:

We create 2 staging tables to copy data as is from AWS S3 buckets.

  1. staging_events - to copy data of all user activity from log files present in 'log_data' directory in S3 bucket located at

    s3://udacity-dend/log_data

  2. staging_songs - to copy data of all song tracks prsent in files located at following S3 destination

    s3://udacity-dend/song_data >

Data Modeling:

Using the song and log datasets, we create a star schema optimized for queries on song play analysis.

Following is how we modelled the data in 1 fact table and 4 corresponding dimension tables:

Fact Table:

  1. songplays - records in log data associated with song plays i.e. records with the page NextSong
    • songplay_id, start_time, user_id, level, song_id, artist_id, session_id, location, user_agent

Dimesion Tables:

  1. users - users in the app
    • user_id, first_name, last_name, gender, level
  2. songs - songs in music database
    • song_id, title, artist_id, year, duration
  3. artists - artists in music database
    • artist_id, name, location, latitude, longitude
  4. time - timestamps of records in songplays broken down into specific units
    • start_time, hour, day, week, month, year, weekday

ETL process

The ETL process is written in python, has Redshift as a data warehouse and makes use 'psycopg2' library for connecting and writing to Redshift.

The ETL process is modelled mainly with two components/processors:

  1. Copy data from S3 files to staging tables using Redshift COPY command
    e.g.

     copy staging_songs
     from {SONG_DATA}
     iam_role {IAM_ROLE}
     region 'us-west-2'
     json 'auto';
  2. Insert data from these staging tables in Redshift into our modelled fact and dimesional tables within Redshift suited for analysis


e.g.
  INSERT INTO users (user_id, first_name, last_name, gender, level)
  SELECT user_id, first_name, last_name, gender, level
  FROM staging_events where user_id is NOT NULL;