Formula 1, or F1, "is the highest class of single-seater auto racing sanctioned by the Fédération Internationale de l'Automobile (FIA)" (Source: Wikipedia). F1 is a global sport, with the annual championship involving races that have been held on all continents (except Antarctica).
In F1, constructors build two race cars and engage drivers to race in them. Drivers first compete for a starting position in the race in a qualifying round, with the fastest driver starting at the front. In the actual race, drivers compete for a higher finishing position by overtaking other cars. Each race involves about 60-80 laps, depending on the circuit.
This project constitutes the final project submission for GR5069: Applied Data Science offered by the Quantitative Methods in Social Science department in Columbia University. The project uses data from an F1 dataset to solve one inferential and one predictive problem. Full data was provided by the course instructors through an Amazon S3 bucket maintained for the course, although the data seem similar to this set from Kaggle.
- Build and maintain a well structured and documented record of work on a data
science project to facilitate transferability and reproduction of work.
- Maintain a well-structured GitHub Repo with an informative landing page.
- Comment code using established best practices
- Make commits using established best practices
- Track models built for the project
- Exercise proper data management on Amazon S3.
- Understand and practice the different philosophies behind inferential and predictive data modelling.
- Use Data Visualisations effectively.
- Deal with missing data appropriately for a given task.
The Inferential task seeks to answer the question "what factors explain why a driver arrives in second place in F1 races between 1950 and 2010?"
The question is approached using an informal theory of F1 strategy and performance, operationalised as a statistical model to understand which factors affect a driver's chance of arriving in second place, and how.
The prediction task seeks to build a predictive model to predict which driver comes in second place for races between 2011 and 2017, using data from 1950 to 2010.
- Create basic repo structure
- Populate README
- Draw up to do list
- Populate and maintain references
- Choose and justify choice of model evaluation metric
- Split data into training and test data
- Run and track several models with different hyperparameters on the test set.
- Experiment with features that could help predict the target
- Reuse features from inferential task
- Add more features that could help
- Wrangle data to provide these other features
- Use feature extraction and feature selection techniques to generate other features to try
- Iterate over 3 and 4 to try and improve model performance
- Share model results
- Explain and discuss best performing models
- Provide statistics of model performance
- Provide measures of feature importance
- Comparison of predictive model with inferential model
- Develop and explain informal theory of F1 and testable hypotheses in
Inferential.md - Develop and explain statistical approach, and operationalisation of
variables, for the inferential task in
Inferential.md - Wrangle data to provide variables of interest based on 2.
- Run analysis and present results
- Overall model fit
- Variable importance
- Marginal effects of variables
- Discussion of statistical results in relation to proposed theory
project\
|
| -- src
| |-- data <- Code to read/munge raw data.
| |-- features <- Code to transform/append data.
| |-- models <- Code to analyze the data.
| |-- visualizations <- Code to generate visualizations.
|
| -- reports
| |-- documents <- Documents synthesizing the analysis.
| |-- figures <- Images generated by the code.
|
| -- References.md <- Data dictionaries, explanatory materials.
|
| -- README.md <- Project description.