Skip to content

pcsmomo/data-science-ml-andrei

Repository files navigation

Complete Machine Learning & Data Science Bootcamp by Andrei Neagoie

Complete Machine Learning & Data Science Bootcamp by Andrei Neagoie

If we want to review all curriculums in jupyter notebook,
check https://github.com/mrdbourke/zero-to-mastery-ml.git
super-well-organized!

Folder structure

  • 11-prj-supervised-classification
    • heart disease
  • 12-prj-supervised-regression
    • bulldozer price
  • 14-prj-neural-networks-tensorflow
    • dog breed classification
    • 01-intro: (212. Google Colab Workspace - 230. Preparing Our Inputs and Outputs)
    • 02-build-model: (232. Building A Deep Learning Model - )

Details

Click to Contract/Expend

Section 2: Machine Learning 101

7. Exercise: Machine Learning Playground

Teachable machine with google

9. Exercise: YouTube Recommendation Engine

ML Playground

10. Types of Machine Learning

  • Supervicsed
    • classification
    • regression
  • Unsupervices
    • clustering
    • assiciation rule learning
  • Reinforcement
    • skill acquisition
    • real time learning

Section 3: Machine Learning and Data Science Framework

16. Introducing Our Framework

  1. Create a framework
  2. Match to data science and machine learning tools
  3. Learn by doing

17. 6 Step Machine Learning Framework

A 6 Step Field Guide for Building Machine Learning Projects

6-step-ml-framework

  1. Data collection
  2. Data modelling
    1. Problem definition
    • What problem are we trying to solve?
    1. Data
    • What data do we have?
    1. Evaluation
    • What defines success
    1. Features
    • What features should we model?
    1. Modelling
    • What kind of model should we use?
    1. Experimentation
    • What have we tried / what else ca we try?
  3. Deployment

18. Types of Machine Learning Problems

  • Supervised learning: "I know my inputs and outputs"
    • classification
      • binary classification: two options
      • multi-class classification: more than two options
    • refression
      • predict numbers
  • Unsupervised learning: "I'm not sure of the outputs but I have inputs"
    • cluster
  • Transfer learning: "I think my problem may be similar to something else"
  • Reinforcement learning
    • real-time learning: e.g Alphago

When shouldn't you use machine learning?

  • Will a simple hand-coded instruction based system work?

19. Types of Data

Structured/Unstructured

  • Structured
    • excel, csv, etc.
  • Unstructured
    • images?

Static/Streaming

  • Static
    • csv
  • Streaming

20. Types of Evaluation

Classification Regression Recommendation
Accuracy Mean Absolute Error (MAE) Precision at K
Precision Mean Squared Error (MSE)
Recall Root mean squared error (RMSE)

21. Features In Data

  • Numerical features
  • Categorical features

Feature engineering: Looking at different features of data and creating new ones/altering existing ones

What features should you use?

Feature Coverage: How many samples have different features? Ideally, every sample has the same featuers

22. Modelling - Splitting Data

3 parts to modelling

  1. Choosing and training a model - training data
  2. Tuning a model - validation data
  3. Model comparison - test data

The most important concept in machine learning: The 3 sets

  • Training (Course materials): eg. 70-80%
  • Validation (Practice exam: eg. 10-15%)
  • Test (Final exam: eg. 10-15%)

Generalization: The ability for a machine learning model to perform well on data it hasn't seen before

23. Modelling - Picking the Model

  • Structured Data
    • CarBoost
    • Random Forest
  • Unstructured Data
    • Deep Learning
    • Transfer Learning

Goal! Minimise time between experiments

25. Modelling - Comparison

  • Underfitting
    • Training: 64%, Test: 47%
  • Balanced (Goldilocks zone)
    • Training: 98%, Test: 96%
  • Overfitting
    • Training: 93%, Test: 99%

Fixes for overfitting and underfitting

  • Underfitting
    • Try a more advanced model
    • Increase model hyperparameters
    • Reduce amount of features
    • Train longer
  • Overfitting
    • Collect more data
    • Try a less advanced model

Things to remember

  • Want to avoid overfitting and underfitting (head towards generality)
  • Keep the test set separate at all costs
  • Compare apples to apples
  • One best performance metric does not equal best model

28. Tools We Will Use

  • Overall
    • Anaconda
    • Jypiter
  • Data anaysis
    • Pandas
    • matplotlib
    • NumPy
  • Machine learning
    • TensorFlow
    • PyTorch
    • Scikit-learn
    • CatBoost
    • dmlc/XGBoost

29. Optional: Elements of AI

Elements of AI

Section 5: Data Science Environment Setup

35. What is Conda?

  • Anaconda:
  • miniconda:
  • Conda : package manager

37. Mac Environment Setup

# Install miniconda
sh /Users/noah/Downloads/Miniconda3-latest-MacOSX-arm64.sh
# miniconda3 is installed in ~/miniconda3
# and also it will add conda setup to my ~/.zshrc file

# Create a virtual environment
(base) %
conda create --prefix ./env pandas numpy matplotlib scikit-learn

# To activate this environment, use
#
#     $ conda activate /Users/noah/Documents/study/study_codes/udemy/data-science-ml-andrei/data-science-ml-andrei-git/env
#
# To deactivate an active environment, use
#
#     $ conda deactivate

to remove (base)

delete conda setup in ~/.zshrc

38. Mac Environment Setup 2

conda install jupyter
jupyter notebook

42. Sharing your Conda Environment

# export
conda env export --prefix ./env > environment.yml

# create env from the env file
# conda env create --file environment.yml --name env_from_file
# this will install the env_form_file in ~/miniconda3/envs
conda env create --file environment.yml --prefix ./env_from_file

43. Jupyter Notebook Walkthrough

.ipynb is the old name of jupyter notebook file

44. Jupyter Notebook Walkthrough 2

Short-cuts

  • Command mode: Escape
  • Input mode: Enter
  • m (in command mode): to Markdown
  • y (in command mode): to Code
  • a: insert cell above
  • b: insert cell above
  • d, d: line delete
  • Ctrl + Enter: Run Cells
  • Shift + Enter: Run Cells and select below
  • Opt + Enter: Run Cells and insert below
  • Shift + Tab: display a hint

Section 6: Pandas: Data Analysis

49. Series, Data Frames and CSVs

anatomy-of-a-dataframe

Section 7: NumPy

60. NumPy Introduction

  • It's fast
  • Behind the scenes optimizations written in C
  • Vectorization via broadcasting (avoiding loops)
  • Backbone of other Pythen scientific packages

65. Viewing Arrays and Matrices

numpy.unique documentation

  • a1
    • Names: Array, ventor
    • 1-dimentional
    • Shape = (1, 3)
  • a2
    • Names: Array, matrix
    • More than 1-dimentional
    • Shape = (2, 3)
  • a3
    • Names: Array, matrix
    • More than 1-dimentional
    • Shape = (3, 2, 3)

66. Manipulating Arrays

Numpy Broadcast Rule

67. Manipulating Arrays 2

Aggregation

%timeit sum(massive_array) # Python's sum()
# 3.77 ms ± 120 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit np.sum(massive_array) # NumPy's sum()
# 20.2 µs ± 94.8 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

NumPy's been optimized for numerical calculation so it's so much faster.
So when you can use numpy functions, use numpy one!

Standard deviation

a measure of how spread out a group of numbers is from the mean

np.std(a2)

Variance

a measure of the avaerage degree to which each number is different to the mean

  • Higher variance = wider range of numbers
  • Logher variance = lower range of numbers
np.var(a2)

70. Dot Product vs Element Wise

71. Exercise: Nut Butter Store Sales

Almont butter Peanut butter Cashew butter Total ($)
Mon 2 7 1 88
Tues 9 4 16 314
Wed 11 14 18 438
Thurs 13 13 16 426
Fri 15 18 9 402
Almont butter Peanut butter Cashew butter
Price 10 8 12

!Calculate Total ($) using numpy dot product

72. Comparison Operators

NumPy Logic functions

77. Optional: Extra NumPy resources

A Visual Introduction to NumPy by Jay Alammar

Section 8: Matplotlib: Plotting and Data Visualization

80. Importing And Using Matplotlib

  • matplotlib lifecycle
  • In general, try to use the Object-Oriented interface over the pyplot interface

81. Anatomy Of A Matplotlib Figure

anatomy-of-a-metplatlib-figure

82. Scatter Plot And Bar Plot

Examples to create mock data

# Create some data
x = np.linspace(0, 10, 100)
x[:10]

# Plot the data and create a line plot
fig, ax = plt.subplots()
ax.plot(x, x**2)

# Use same data to create a scatter plot
fig, ax = plt.subplots()
ax.scatter(x, np.exp(x))

# Another scatter plot
fig, ax = plt.subplots()
ax.scatter(x, np.sin(x))

86. Plotting From Pandas DataFrames

Pandas Chart Vusialization

95. Customizing Your Plots 2

matplotlib colormap reference

96. Saving And Sharing Your Plots

fig, (ax0, ax1) = plt.subplots(nrows=2,
                               ncols=1,
                               figsize=(10, 10))
fig.savefig("heart-disease-analysis-plot-saved-with-code.png")

Section 9: Scikit-learn: Creating Machine Learning Models

99. Scikit-learn Introduction

Scikit Learn - User Guide

  • An end-to-end Scikit-Learn workflow
  • Getting data ready (to be used with machine learning models)
  • Choosing a machine learning model
  • Fitting a model to the data (learning patterns)
  • Making predictions with a model (using patterns)
  • Evaluating model predictions
  • Improving model predictions
  • Saving and loading models

105. Optional: Debugging Warnings In Jupyter

conda activate ./env
conda list
conda update scikit-learn
conda list scikit-learn
conda search scikit-learn
conda search scikit-learn --info
# conda install python=3.6.9 scikit-learn=0.22

107. Quick Tip: Clean, Transform, Reduce

Clean Data (empty or missing data)
-> Transform Data(computer understands)
-> Reduce Data(resource manage)

108. Getting Your Data Ready: Convert Data To Numbers

from sklearn.ensemble import RandomForestRegressor  # it can predict number

114. NEW: Choosing The Right Model For Your Data

R_squared = model.score(X_test, y_test) # Return the coefficient of determination of the prediction.

115. NEW: Choosing The Right Model For Your Data 2 (Regression)

Random Forest Explanation

116. Quick Note: Decision Trees

RandomForrestRegressor is based on what we call a Decision Tree algorithm.

120. Making Predictions With Our Model

# evaluation 1
# Compare predictions to truth labels to evaluate the model
y_preds = clf.predict(X_test)
np.mean(y_preds == y_test)

# evaluation 2
clf.score(X_test, y_test)

# evaluation 3
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_preds)

121. predict() vs predict_proba()

# predict_proba() returns probabilities of a classification label
clf.predict_proba(X_test[:5])
# array([[0.89, 0.11],
#        [0.49, 0.51],
#        [0.43, 0.57],
#        [0.84, 0.16],
#        [0.18, 0.82]])

123. NEW: Evaluating A Machine Learning Model (Score) Part 1

sklearn regression evaluation

125. Evaluating A Machine Learning Model 2 (Cross Validation)

Use this kind of scoring strategy to avoid getting lucky score

from sklearn.model_selection import cross_val_score
cross_val_score(clf, X, y, cv=5)
# array([0.90322581, 0.83870968, 0.87096774, 0.9       , 0.86666667,
#        0.8       , 0.76666667, 0.83333333, 0.73333333, 0.83333333])

cross-validation

  • ROC and AUC, Clearly Explained! by StatQuest
  • ROC documentation in Scikit-Learn (contains code examples)
  • How the ROC curve and AUC are calculated by Google's Machine Learning team

130. Evaluating A Classification Model 4 (Confusion Matrix)

Install seaborn package in Jupyter notebook

# How to install a conda package into the current environment from a Jupyter Notebook
import sys
%conda install seaborn --yes --prefix {sys.prefix}
# or install in terminal
conda install seaborn

132. Evaluating A Classification Model 6 (Classification Report)

3.3. Metrics and scoring: quantifying the quality of predictions

136. Machine Learning Model Evaluation

Evaluating the results of a machine learning model is as important as building one.

But just like how different problems have different machine learning models, different machine learning models have different evaluation metrics.

Below are some of the most important evaluation metrics you'll want to look into for classification and regression models.

Classification Model Evaluation Metrics/Techniques

  • Accuracy - The accuracy of the model in decimal form. Perfect accuracy is equal to 1.0.
  • Precision - Indicates the proportion of positive identifications (model predicted class 1) which were actually correct. A model which produces no false positives has a precision of 1.0.
  • Recall - Indicates the proportion of actual positives which were correctly classified. A model which produces no false negatives has a recall of 1.0.
  • F1 score - A combination of precision and recall. A perfect model achieves an F1 score of 1.0.
  • Confusion matrix - Compares the predicted values with the true values in a tabular way, if 100% correct, all values in the matrix will be top left to bottom right (diagonal line).
  • Cross-validation - Splits your dataset into multiple parts and train and tests your model on each part then evaluates performance as an average.
  • Classification report - Sklearn has a built-in function called classification_report() which returns some of the main classification metrics such as precision, recall and f1-score.
  • ROC Curve - Also known as receiver operating characteristic is a plot of true positive rate versus false-positive rate.
  • Area Under Curve (AUC) Score - The area underneath the ROC curve. A perfect model achieves an AUC score of 1.0.

Which classification metric should you use?

  • Accuracy is a good measure to start with if all classes are balanced (e.g. same amount of samples which are labelled with 0 or 1).
  • Precision and recall become more important when classes are imbalanced.
  • If false-positive predictions are worse than false-negatives, aim for higher precision.
  • If false-negative predictions are worse than false-positives, aim for higher recall.
  • F1-score is a combination of precision and recall.
  • A confusion matrix is always a good way to visualize how a classification model is going.

Regression Model Evaluation Metrics/Techniques

  • R^2 (pronounced r-squared) or the coefficient of determination - Compares your model's predictions to the mean of the targets. Values can range from negative infinity (a very poor model) to 1. For example, if all your model does is predict the mean of the targets, its R^2 value would be 0. And if your model perfectly predicts a range of numbers it's R^2 value would be 1.
  • Mean absolute error (MAE) - The average of the absolute differences between predictions and actual values. It gives you an idea of how wrong your predictions were.
  • Mean squared error (MSE) - The average squared differences between predictions and actual values. Squaring the errors removes negative errors. It also amplifies outliers (samples which have larger errors).

Which regression metric should you use?

  • R2 is similar to accuracy. It gives you a quick indication of how well your model might be doing. Generally, the closer your R2 value is to 1.0, the better the model. But it doesn't really tell exactly how wrong your model is in terms of how far off each prediction is.
  • MAE gives a better indication of how far off each of your model's predictions are on average.
  • As for MAE or MSE, because of the way MSE is calculated, squaring the differences between predicted values and actual values, it amplifies larger differences. Let's say we're predicting the value of houses (which we are).
    • Pay more attention to MAE: When being $10,000 off is twice as bad as being $5,000 off.
    • Pay more attention to MSE: When being $10,000 off is more than twice as bad as being $5,000 off.

For more resources on evaluating a machine learning model, be sure to check out the following resources:

162. Choosing The Right Models

Google it
Top 6 Machine Learning Algorithms for Classification

Section 12: Milestone Project 2: Supervised Learning (Time Series Data)

175. Project Overview

Bulldozers price decision

Blue Book for Bulldozers

177. Project Environment Setup

Data Description - Kaggle

The data for this competition is split into three parts:

  • Train.csv is the training set, which contains data through the end of 2011.
  • Valid.csv is the validation set, which contains data from January 1, 2012 - April 30, 2012 You make predictions on this set throughout the majority of the competition. Your score on this set is used to create the public leaderboard.
  • Test.csv is the test set, which won't be released until the last week of the competition. It contains data from May 1, 2012 - November 2012. Your score on the test set determines your final rank for the competition.

The key fields are in train.csv are:

  • SalesID: the uniue identifier of the sale
  • MachineID: the unique identifier of a machine. A machine can be sold multiple times
  • saleprice: what the machine sold for at auction (only provided in train.csv)
  • saledate: the date of the sale

Section 14: Neural Networks: Deep Learning, Transfer Learning and TensorFlow 2

211. Setting Up Google Colab

Let's start!

212. Google Colab Workspace

Google Colab FAQ

!unzip "drive/MyDrive/Colab Notebooks/data/dog-breed-identification.zip" -d "drive/MyDrive/Colab Notebooks/data/dog-vision"

218. Using A GPU

Runtime -> Change runtime type -> Hardware accelerator: GPU

221. Loading Our Data Labels

  • Google Colab short-cut list: Comm + M + H
  • See the docstring: Shift + Comm + Space

225. Preprocess Images

227. Turning Data Into Batches

231. Optional: How machines learn and what's going on behind the scenes?

232. Building A Deep Learning Model

233. Building A Deep Learning Model 2

234. Building A Deep Learning Model 3

Layers

Binary classification Multi-class classification
Activation sigmoid softmax
Loss Binary Cross Entropy Category Cross Entropy

237. Evaluating Our Model

Tensorflow Tensorboard

238. Preventing Overfitting

Tensorflow - Early Stopping Callback

240. Evaluating Performance With TensorBoard

first trained model

242. Transform Predictions To Text

Tensowflow unbatch

250. Making Predictions On Our Images

Attempt to run it on local

Set up for poetry with pyenv

  • pyenv path .zshrc
  • pyenv local 3.9.5
  • pip install poetry
  • python -m pip install --upgrade pip
  • python -m poetry install
    • fail!

      The currently activated Python version 3.9.15 is not supported by the project (3.9).
      Trying to find and use a compatible version.

      Poetry was unable to find a compatible version. If you have one, you can explicitly use it via the "env use" command.\

  • poetry env use /Users/noah/.pyenv/versions/3.9.15/bin/python3.9
    • ah... after chaging to python = "3.9.15" from python = "3.9" in pyproject.toml file, it works!
  • However, eventually there's an tensorflow install error on mac

251. Finishing Dog Vision: Where to next?

  1. Trying another model from TensorFlow Hub - Perhaps a different model would perform better on our dataset. One option would be to experiment with a different pre-trained model from TensorFlow Hub or look into the tf.keras.applications module.
  2. Data augmentation - Take the training images and manipulate (crop, resize) or distort them (flip, rotate) to create even more training data for the model to learn from. Check out the TensorFlow images documentation for a whole bunch of functions you can use on images. A great idea would be to try and replicate the techniques in this example cat vs. dog image classification notebook for our dog breeds problem.
  3. Fine-tuning - The model we used in this notebook was directly from TensorFlow Hub, we took what it had already learned from another dataset (ImageNet) and applied it to our own. Another option is to use what the model already knows and fine-tune this knowledge to our own dataset (pictures of dogs). This would mean all of the patterns within the model would be updated to be more specific to pictures of dogs rather than general images.

If you're after more, one of the best ways to find out something is to search for something like:

  • "How to improve a TensorFlow 2.x image classification model?"
  • "TensorFlow 2.x image classification best practices"
  • "Transfer learning for image classification with TensorFlow 2.x"
  • "Deep learning project examples with TensorFlow 2.x"

The TensorFlow developers have even put together a massive compilation of all of their favourite TensorFlow and machine learning resources.

When you see an example you think might be beyond your reach (because it looks too complicated), remember if in doubt, run the code. Try and reproduce what you see. This is the best way to get hands-on and build your own knowledge.

No one starts out knowing how to do everything single thing. They just get better are knowing what to look for.

And remember, if you have any questions, don't forget to send @mrdbourke or @AndreiNeagoie a message on Twitter or in the Discord chat!

Section 15: Storytelling + Communication: How To Present Your Work

259. Communicating and sharing your work: Further reading

Section 19: Extra: Learn Advanced Statistics and Mathematics for FREE!

374. Statistics and Mathematics

As you see, you do not need to have a degree in Mathematics to be a great Data Scientist. If you have finished the course and you are looking to expand your knowledge, or you are simply curious, Daniel and I recommend the below free resources. In our opinion, they are the BEST resources for you to learn these topics and have fun along the way without falling asleep:

About

Complete Machine Learning & Data Science Bootcamp by Andrei Neagoie

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors