Introduction: This report aims to present an exploratory research done on the Los Angeles Payroll Dataset to better understand the Statistical Methods and use them in answering real life problems that the data poses. The Hypothesis Testing using different techniques like Normal Distribution, Students T-Test and ANOVA has been performed on various features of the dataset to evaluate the validity of the various assumptions presumed as Null hypothesis. The report also presents the work done to predict the Annual Salary and Average Health Cost making use of Linear regression and RandomForest (Decision trees) algorithms.
Dataset: Los Angeles Payroll Data of four subsequent years starting from 2013 to 2016 under the Finance/Business Division has been taken for performing the statistical analysis. The dataset is self-explanatory giving an insight into the payroll of the employees in different Departments segmented as per their job titles over the four years. The motivation behind selecting this dataset is that it poses a lot of questions which are explored while working on this project, using the statistical methods. The dataset tells about the employment type, Hourly or Event rate of different personnel and their projected salaries, it depicts the Base pay, Quarterly pay and the annual pay of the individual and what kind of Bonuses, overtime pay and benefits they receive. The data also tells about the Job class and the paygrade of the employees.
Technologies Used: Python 3.5: To perform all the Hypothesis Tests, and to implement and run the tests Anaconda’s Jupyter Notebook is used. Open refine: Before starting to work on the dataset, the first step is to explore and fully understand what the data Is. After comprehending the dataset we cleansed the data and removed the null values and the outliers, formatting the dataset. For data cleansing, Open Refine has been utilized to cleanse the dataset.
Questions Explored:
- Using the Hypothesis testing with Normal distribution, determine whether the Annual Salaries of the employees increase in the Financial year 2016 as compared to 2015?
- Using the Hypothesis testing with Student’s T-Distribution, check if the Average Health Cost of the Employees increases over a period of one year: 2014-2015?
- Using ANOVA, find if in three subsequent years 2014, 2015, 2016 – does the mean of the Base salary of the Electricians remains the same?
- Will the Linear Regressor be able to predict the Average Benefit cost based on Annual and quarterly Payments?
- Will the Random forest Regressor making use of the Decision Trees be able to predict the Annual salary based on the Job title?