This project demonstrates the integration of Python with MySQL databases for data analysis workflows. The system illustrates the complete process of importing data from CSV files into a relational database, creating appropriate table structures, and performing increasingly sophisticated SQL queries to derive meaningful insights.
PythonMySQLCSV.webm
Designed as both a learning tool and a practical reference implementation, this project showcases real-world data processing techniques that are commonly used in data analysis, business intelligence, and academic research settings.
This project serves multiple educational purposes:
- Database Integration: Demonstrates how Python can interact with MySQL databases using the PyMySQL library
- Data Processing Pipeline: Shows a complete ETL (Extract, Transform, Load) workflow from raw CSV data to structured database storage
- SQL Query Development: Provides incrementally challenging SQL exercises from basic CRUD operations to complex analytical queries
- Data Visualization: Illustrates how to present query results in a readable, formatted manner in the terminal environment
- Real-world Application: Models a simplified version of academic record management and analysis
By working through this project, you'll gain practical experience with:
- Establishing secure database connections with credential management
- Parsing and validating CSV data
- Creating robust database queries
- Implementing data categorization and grading systems
- Analyzing relative performance using gap analysis
- Formatting and presenting data analysis results
The project follows a modular architecture with clear separation of concerns, making it easier to understand each component's responsibility and how they interact together to form a complete system.
-
db_config.py:
- Manages database connection configuration and security
- Loads credentials from environment variables for better security practices
- Implements a connection factory pattern with error handling
- Creates the necessary database table structure with appropriate constraints
- Provides a clean connection closure mechanism
-
csv_importer.py:
- Handles the extraction and transformation of data from CSV sources
- Implements CSV parsing with validation checks
- Performs data cleaning and type conversion
- Executes SQL INSERT operations for database loading
- Includes detailed logging of the import process
-
data_analyzer.py:
- Contains the core analytical functions for student performance analysis
- Implements database retrieval functions with optimized queries
- Formats and displays data in human-readable tables
- Calculates summary statistics for student performance metrics
- Identifies top performers for recognition
-
main.py:
- Orchestrates the entire data processing workflow
- Manages the execution sequence of the application
- Provides user feedback throughout the process
- Handles exception management and error reporting
- Serves as the entry point for the application
- additional_queries_exercise.py:
- Contains structured SQL query exercises of increasing complexity
- Provides clear instructions and expected outcomes for each exercise
- Includes placeholder comments guiding implementation
- Features built-in visualization of query results
- Offers a practical platform for SQL skill development
- data/students.csv:
- Contains structured student record data featuring Star Wars character names
- Includes student identification, names, and CGPA values
- Serves as the raw data source for the database import process
- Provides a balanced distribution of grade values for meaningful analysis
Before running this project, you need:
- Python 3.10 or higher
- MySQL Server 8.0 or higher
- Access to create databases and tables on your MySQL server
- Basic understanding of SQL and Python
-
Clone or download the repository:
git clone https://github.com/prakash-aryan/python_mysql_csv_analyzer.git
-
Create a virtual environment (recommended):
python -m venv .venv source .venv/bin/activate # On Windows: .venv\Scripts\activate
-
Install required dependencies:
pip install -r requirements.txt
This will install:
- pymysql: For MySQL database connectivity
- python-dotenv: For secure environment variable management
- pandas: For data manipulation (optional, used in some advanced queries)
- rich: For terminal output formatting
-
Database Configuration: Create and Edit the
.env
file with your MySQL connection details:# MySQL Database Configuration DB_HOST=your_database_host # e.g., localhost DB_PORT=3306 # Default MySQL port DB_NAME=your_database_name # Must exist on the server DB_USER=your_username # Must have CREATE TABLE permissions DB_PASSWORD=your_password # Store securely
-
Database Preparation:
- Ensure your MySQL server is running
- Create the database specified in your
.env
file:CREATE DATABASE your_database_name;
- The application will automatically create the table structure when run
-
Verify Setup: Check that:
- Your virtual environment is activated
- Your MySQL server is accessible with the provided credentials
- The specified database exists
- The data/students.csv file is present in the correct location
The project is designed to guide you through a progressive learning experience, starting with fundamental database operations and moving towards more complex analytical queries.
Start by running the main application to establish your database, import the data, and perform basic analysis:
python main.py
What this does:
-
Database Connection:
- Establishes a secure connection to your MySQL server
- Outputs connection details including host and port
- Verifies database accessibility
-
Table Creation:
- Checks if the
student_records
table exists - Creates the table with appropriate column types and constraints if needed
- Confirms table structure is ready for data import
- Checks if the
-
Data Import Process:
- Opens and validates the CSV file structure
- Parses the header row and confirms expected columns
- Performs data type validation for each record
- Handles potential errors with appropriate messaging
- Truncates any existing data to ensure a fresh import
- Imports each record with SQL parameterization for security
- Provides a count of successfully imported records
-
Data Analysis and Display:
- Retrieves all records with formatted SQL query
- Presents data in a well-structured table format
- Calculates key statistical measures (min, max, average CGPA)
- Identifies and highlights top-performing students
- Summarizes the overall class performance
Expected Output: You will see a detailed, step-by-step progress output with formatted tables showing:
- Connection and initialization status
- Import progress and confirmation
- Complete student records in tabular form
- Statistical analysis of the CGPA distribution
- Identification of top performers
This output allows you to verify the entire data pipeline is functioning correctly before proceeding to the exercises.
Once you've run the main application and verified that the data pipeline is functioning correctly, you can proceed to the SQL query exercises, which represent the core educational component of this project:
python additional_queries_exercise.py
The exercise file contains three progressively challenging query tasks, each designed to develop specific SQL skills and analytical thinking. Currently, these contain placeholder comments that you'll need to replace with working SQL statements.
Learning Objectives:
- Implementing SQL CASE statements for data categorization
- Grouping and aggregating data with GROUP BY
- Applying multiple aggregate functions in a single query
- Sorting grouped results meaningfully
Task Description: Write a query that:
- Creates a performance category label using CGPA thresholds:
- Excellent: CGPA ≥ 3.8
- Very Good: CGPA ≥ 3.5
- Good: CGPA ≥ 3.3
- Satisfactory: CGPA ≥ 3.0
- Needs Improvement: CGPA < 3.0
- Groups students by these categories
- Calculates for each category:
- Student count
- Minimum CGPA
- Maximum CGPA
- Average CGPA
- Orders results by minimum CGPA in descending order
Visualization: The results will be displayed in a formatted table showing the distribution of students across performance categories with associated statistics.
Learning Objectives:
- Creating complex CASE expressions with multiple conditions
- Implementing academic grading standards in SQL
- Constructing multiple related queries that share logic
- Calculating percentage distributions
Task Description: This exercise requires two related queries:
Query 1:
- Select student ID, name, and CGPA
- Create a letter grade column using this grading scale:
- A+: CGPA ≥ 3.9
- A: CGPA ≥ 3.7
- A-: CGPA ≥ 3.5
- B+: CGPA ≥ 3.3
- B: CGPA ≥ 3.0
- B-: CGPA ≥ 2.7
- C+: CGPA ≥ 2.3
- C: CGPA ≥ 2.0
- F: CGPA < 2.0
- Order results by CGPA in descending order
Query 2:
- Use the identical CASE logic to ensure consistency
- Count how many students received each letter grade
- Group the results by letter grade
- Order by minimum CGPA in descending order
Visualization: Results will be displayed as two tables:
- Individual student grades showing ID, name, CGPA, and assigned letter grade
- Grade distribution summary showing counts and percentages for each grade
Learning Objectives:
- Writing queries that enable sequential data analysis
- Understanding how to prepare data for gap calculations
- Implementing ranking and ordering in SQL
- Identifying significant performance thresholds
Task Description: Write a query that:
- Selects student ID, name, and CGPA
- Orders the results precisely by CGPA in descending order
- Provides the data needed for the Python code to calculate gaps between consecutive students
Visualization: The results will be displayed as a ranked table showing:
- Student rank by CGPA
- Student ID and name
- CGPA value
- Gap to the next ranked student
- Highlighting of the largest performance gap
- Analysis of where the significant performance jumps occur
To successfully complete these exercises:
-
Understand the Requirements:
- Carefully read each exercise description
- Note the specific columns needed in the result set
- Pay attention to ordering requirements
- Understand how the result will be used by the display code
-
Implement the Queries:
- Open
additional_queries_exercise.py
in your preferred code editor - Locate the placeholder comments for each query
- Write SQL statements that fulfill all requirements
- Ensure your column names match exactly what the display code expects
- Open
-
Test Incrementally:
- Implement one query at a time
- Run the script after each implementation to test your solution
- Debug any errors by checking column names and syntax
-
Verify Results:
- Check that your output tables look correct and well-formatted
- Verify that the data makes sense (e.g., grade assignments match CGPA values)
- Confirm that ordering and grouping work as expected
-
Refine Your Solutions:
- Look for opportunities to improve query efficiency
- Ensure consistent CASE logic between related queries
- Consider edge cases in your data categorization
The student_records
table is designed with a straightforward but effective schema optimized for academic record storage:
CREATE TABLE IF NOT EXISTS student_records (
id_no VARCHAR(20) PRIMARY KEY,
name VARCHAR(100) NOT NULL,
cgpa FLOAT NOT NULL
)
Schema Design Rationale:
id_no
as VARCHAR(20): Allows for alphanumeric student IDs with prefix coding (e.g., S001)name
as VARCHAR(100): Accommodates full names of various lengths with room for special characterscgpa
as FLOAT: Stores cumulative grade point average with decimal precision- PRIMARY KEY constraint: Ensures each student record is unique and optimizes lookup queries
- NOT NULL constraints: Maintains data integrity by preventing missing critical information
The system uses a curated dataset featuring Star Wars character names to make the learning experience more engaging:
ID_NO | NAME | CGPA |
---|---|---|
S001 | Luke Skywalker | 3.7 |
S002 | Princess Leia | 3.9 |
S003 | Han Solo | 3.5 |
S004 | Rey Skywalker | 4.0 |
S005 | Obi-Wan Kenobi | 3.2 |
S006 | Padmé Amidala | 3.8 |
S007 | Anakin Skywalker | 3.1 |
S008 | Ahsoka Tano | 3.6 |
S009 | Din Djarin | 3.3 |
S010 | Grogu | 3.4 |
Dataset Characteristics:
- 10 student records providing sufficient data for meaningful analysis
- CGPA values ranging from 3.1 to 4.0, representing a realistic distribution
- Strategic distribution of values to demonstrate interesting patterns in analysis
- Balanced representation across different performance categories
- Sequential student IDs for easy reference
This data structure provides a foundation for practicing SQL queries ranging from simple selections to complex analytical grouping operations.
When implementing your solutions for the exercises, consider these SQL best practices and techniques:
-
Effective CASE Statements:
CASE WHEN condition1 THEN result1 WHEN condition2 THEN result2 ... ELSE default_result END AS column_name
- Use logical progression in condition order (often highest to lowest)
- Ensure conditions don't overlap to avoid unexpected results
- Include an ELSE clause to handle all possible values
- Give meaningful aliases to derived columns
-
Aggregate Functions:
COUNT(*)
: Count all rows in a groupMIN(column)
: Find the minimum valueMAX(column)
: Find the maximum valueAVG(column)
: Calculate the average valueSUM(column)
: Total the values in a column- Consider using
ROUND(AVG(column), 2)
for cleaner decimal display
-
Grouping and Sorting:
GROUP BY
: Must include all non-aggregated columns in SELECTORDER BY
: Can reference columns by position number (e.g.,ORDER BY 1 DESC
)- Multiple grouping:
GROUP BY column1, column2
- Multiple sorting:
ORDER BY column1 DESC, column2 ASC
-
Readability Practices:
- Use consistent indentation for nested statements
- Place each major clause (SELECT, FROM, WHERE, etc.) on a new line
- Use table aliases for complex queries (
FROM table_name AS t
) - Comment complex logic, particularly in CASE statements
-
Common Pitfalls to Avoid:
- Forgetting to include columns in GROUP BY that appear in SELECT
- Using ambiguous column names when they could appear in multiple tables
- Missing edge cases in CASE statement conditions
- Not handling NULL values appropriately