Skip to content

timothyLeeXQ/Prediction-and-Inference-with-F1-Data

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Inferential and Predictive Analytics with Formula 1 Data

About Formula 1

Formula 1, or F1, "is the highest class of single-seater auto racing sanctioned by the Fédération Internationale de l'Automobile (FIA)" (Source: Wikipedia). F1 is a global sport, with the annual championship involving races that have been held on all continents (except Antarctica).

In F1, constructors build two race cars and engage drivers to race in them. Drivers first compete for a starting position in the race in a qualifying round, with the fastest driver starting at the front. In the actual race, drivers compete for a higher finishing position by overtaking other cars. Each race involves about 60-80 laps, depending on the circuit.

About the project

This project constitutes the final project submission for GR5069: Applied Data Science offered by the Quantitative Methods in Social Science department in Columbia University. The project uses data from an F1 dataset to solve one inferential and one predictive problem. Full data was provided by the course instructors through an Amazon S3 bucket maintained for the course, although the data seem similar to this set from Kaggle.

Project Objectives

  • Build and maintain a well structured and documented record of work on a data science project to facilitate transferability and reproduction of work.
    • Maintain a well-structured GitHub Repo with an informative landing page.
    • Comment code using established best practices
    • Make commits using established best practices
    • Track models built for the project
  • Exercise proper data management on Amazon S3.
  • Understand and practice the different philosophies behind inferential and predictive data modelling.
  • Use Data Visualisations effectively.
  • Deal with missing data appropriately for a given task.

Inferential Task

The Inferential task seeks to answer the question "what factors explain why a driver arrives in second place in F1 races between 1950 and 2010?"

The question is approached using an informal theory of F1 strategy and performance, operationalised as a statistical model to understand which factors affect a driver's chance of arriving in second place, and how.

Prediction Task

The prediction task seeks to build a predictive model to predict which driver comes in second place for races between 2011 and 2017, using data from 1950 to 2010.

Project Progress

To Do

Done

General

  1. Create basic repo structure
  2. Populate README
  3. Draw up to do list
  4. Populate and maintain references

Prediction Task

  1. Choose and justify choice of model evaluation metric
  2. Split data into training and test data
  3. Run and track several models with different hyperparameters on the test set.
  • Experiment with features that could help predict the target
    • Reuse features from inferential task
    • Add more features that could help
      • Wrangle data to provide these other features
    • Use feature extraction and feature selection techniques to generate other features to try
  1. Iterate over 3 and 4 to try and improve model performance
  2. Share model results
  3. Explain and discuss best performing models
  • Provide statistics of model performance
  • Provide measures of feature importance

Final

  1. Comparison of predictive model with inferential model

Inferential Task

  1. Develop and explain informal theory of F1 and testable hypotheses in Inferential.md
  2. Develop and explain statistical approach, and operationalisation of variables, for the inferential task in Inferential.md
  3. Wrangle data to provide variables of interest based on 2.
  4. Run analysis and present results
  • Overall model fit
  • Variable importance
  • Marginal effects of variables
  1. Discussion of statistical results in relation to proposed theory

Repo File Structure

project\
|
| -- src
|     |-- data            <- Code to read/munge raw data.
|     |-- features        <- Code to transform/append data.
|     |-- models          <- Code to analyze the data.
|     |-- visualizations  <- Code to generate visualizations.
|
| -- reports
|     |-- documents       <- Documents synthesizing the analysis.
|     |-- figures         <- Images generated by the code.
|
| -- References.md        <- Data dictionaries, explanatory materials.
|
| -- README.md            <- Project description.

About

Prediction and Inference Project on F1 Dataset using Python, PySpark, and R. Final Project for GR5069: Applied Data Science

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors