Distributed Machine Learning with PySpark

This project takes in a csv of NBA data, with 128,069 rows and several features including:

Player that took the shot
Distance from the rim
Whether the player was being guarded and by who
Team data
Among other features

The goal of this project is to use PySpark and SparkML to make a classifier that returns the probability that a shot is made given a set of inout features.

As one can see in the file DataViz.ipynb (and almost expect), one of the most predictive features is the distance from the rim. In that notebook, I have included the distribution of the response based on that feature.

Using the following spark modules and helpers to build classifiers with good accuracy and a prediction pipeline:

from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.classification import GBTClassifier
from pyspark.ml import Pipeline

Along with the following modules to get cross validation errors and a parameter grid to do a hyper-parameter optimization grid.

from pyspark.ml.tuning import CrossValidator
from pyspark.ml.tuning import ParamGridBuilder

Achieved area under the ROC curve of 0.63

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
DataViz.ipynb		DataViz.ipynb
ML_Code.ipynb		ML_Code.ipynb
NBA_Int2.csv		NBA_Int2.csv
README.md		README.md
final_nba.csv		final_nba.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Distributed Machine Learning with PySpark

About

Uh oh!

Releases

Packages

Languages

jpoberhauser/dist_comp_final

Folders and files

Latest commit

History

Repository files navigation

Distributed Machine Learning with PySpark

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages