Complete Machine Learning & Data Science Bootcamp by Andrei Neagoie
If we want to review all curriculums in jupyter notebook,
check https://github.com/mrdbourke/zero-to-mastery-ml.git
super-well-organized!
- 11-prj-supervised-classification
- heart disease
- 12-prj-supervised-regression
- bulldozer price
- 14-prj-neural-networks-tensorflow
- dog breed classification
- 01-intro: (212. Google Colab Workspace - 230. Preparing Our Inputs and Outputs)
- 02-build-model: (232. Building A Deep Learning Model - )
Click to Contract/Expend
- Supervicsed
- classification
- regression
- Unsupervices
- clustering
- assiciation rule learning
- Reinforcement
- skill acquisition
- real time learning
- Create a framework
- Match to data science and machine learning tools
- Learn by doing
A 6 Step Field Guide for Building Machine Learning Projects
- Data collection
- Data modelling
- Problem definition
- What problem are we trying to solve?
- Data
- What data do we have?
- Evaluation
- What defines success
- Features
- What features should we model?
- Modelling
- What kind of model should we use?
- Experimentation
- What have we tried / what else ca we try?
- Deployment
- Supervised learning: "I know my inputs and outputs"
- classification
- binary classification: two options
- multi-class classification: more than two options
- refression
- predict numbers
- classification
- Unsupervised learning: "I'm not sure of the outputs but I have inputs"
- cluster
- Transfer learning: "I think my problem may be similar to something else"
- Reinforcement learning
- real-time learning: e.g Alphago
- Will a simple hand-coded instruction based system work?
- Structured
- excel, csv, etc.
- Unstructured
- images?
- Static
- csv
- Streaming
| Classification | Regression | Recommendation |
|---|---|---|
| Accuracy | Mean Absolute Error (MAE) | Precision at K |
| Precision | Mean Squared Error (MSE) | |
| Recall | Root mean squared error (RMSE) |
- Numerical features
- Categorical features
Feature engineering: Looking at different features of data and creating new ones/altering existing ones
Feature Coverage: How many samples have different features? Ideally, every sample has the same featuers
- Choosing and training a model - training data
- Tuning a model - validation data
- Model comparison - test data
- Training (Course materials): eg. 70-80%
- Validation (Practice exam: eg. 10-15%)
- Test (Final exam: eg. 10-15%)
Generalization: The ability for a machine learning model to perform well on data it hasn't seen before
- Structured Data
- CarBoost
- Random Forest
- Unstructured Data
- Deep Learning
- Transfer Learning
Goal! Minimise time between experiments
- Underfitting
- Training: 64%, Test: 47%
- Balanced (Goldilocks zone)
- Training: 98%, Test: 96%
- Overfitting
- Training: 93%, Test: 99%
- Underfitting
- Try a more advanced model
- Increase model hyperparameters
- Reduce amount of features
- Train longer
- Overfitting
- Collect more data
- Try a less advanced model
- Want to avoid overfitting and underfitting (head towards generality)
- Keep the test set separate at all costs
- Compare apples to apples
- One best performance metric does not equal best model
- Overall
- Anaconda
- Jypiter
- Data anaysis
- Pandas
- matplotlib
- NumPy
- Machine learning
- TensorFlow
- PyTorch
- Scikit-learn
- CatBoost
- dmlc/XGBoost
- Anaconda:
- miniconda:
- Conda : package manager
# Install miniconda
sh /Users/noah/Downloads/Miniconda3-latest-MacOSX-arm64.sh
# miniconda3 is installed in ~/miniconda3
# and also it will add conda setup to my ~/.zshrc file
# Create a virtual environment
(base) %
conda create --prefix ./env pandas numpy matplotlib scikit-learn
# To activate this environment, use
#
# $ conda activate /Users/noah/Documents/study/study_codes/udemy/data-science-ml-andrei/data-science-ml-andrei-git/env
#
# To deactivate an active environment, use
#
# $ conda deactivatedelete conda setup in ~/.zshrc
conda install jupyter
jupyter notebook# export
conda env export --prefix ./env > environment.yml
# create env from the env file
# conda env create --file environment.yml --name env_from_file
# this will install the env_form_file in ~/miniconda3/envs
conda env create --file environment.yml --prefix ./env_from_file.ipynb is the old name of jupyter notebook file
- Command mode: Escape
- Input mode: Enter
- m (in command mode): to Markdown
- y (in command mode): to Code
- a: insert cell above
- b: insert cell above
- d, d: line delete
- Ctrl + Enter: Run Cells
- Shift + Enter: Run Cells and select below
- Opt + Enter: Run Cells and insert below
- Shift + Tab: display a hint
- It's fast
- Behind the scenes optimizations written in C
- Vectorization via broadcasting (avoiding loops)
- Backbone of other Pythen scientific packages
- a1
- Names: Array, ventor
- 1-dimentional
- Shape = (1, 3)
- a2
- Names: Array, matrix
- More than 1-dimentional
- Shape = (2, 3)
- a3
- Names: Array, matrix
- More than 1-dimentional
- Shape = (3, 2, 3)
%timeit sum(massive_array) # Python's sum()
# 3.77 ms ± 120 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit np.sum(massive_array) # NumPy's sum()
# 20.2 µs ± 94.8 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)NumPy's been optimized for numerical calculation so it's so much faster.
So when you can use numpy functions, use numpy one!
a measure of how spread out a group of numbers is from the mean
np.std(a2)a measure of the avaerage degree to which each number is different to the mean
- Higher variance = wider range of numbers
- Logher variance = lower range of numbers
np.var(a2)| Almont butter | Peanut butter | Cashew butter | Total ($) | |
|---|---|---|---|---|
| Mon | 2 | 7 | 1 | 88 |
| Tues | 9 | 4 | 16 | 314 |
| Wed | 11 | 14 | 18 | 438 |
| Thurs | 13 | 13 | 16 | 426 |
| Fri | 15 | 18 | 9 | 402 |
| Almont butter | Peanut butter | Cashew butter | |
|---|---|---|---|
| Price | 10 | 8 | 12 |
!Calculate Total ($) using numpy dot product
A Visual Introduction to NumPy by Jay Alammar
- matplotlib lifecycle
- In general, try to use the Object-Oriented interface over the pyplot interface
Examples to create mock data
# Create some data
x = np.linspace(0, 10, 100)
x[:10]
# Plot the data and create a line plot
fig, ax = plt.subplots()
ax.plot(x, x**2)
# Use same data to create a scatter plot
fig, ax = plt.subplots()
ax.scatter(x, np.exp(x))
# Another scatter plot
fig, ax = plt.subplots()
ax.scatter(x, np.sin(x))fig, (ax0, ax1) = plt.subplots(nrows=2,
ncols=1,
figsize=(10, 10))
fig.savefig("heart-disease-analysis-plot-saved-with-code.png")- An end-to-end Scikit-Learn workflow
- Getting data ready (to be used with machine learning models)
- Choosing a machine learning model
- Fitting a model to the data (learning patterns)
- Making predictions with a model (using patterns)
- Evaluating model predictions
- Improving model predictions
- Saving and loading models
conda activate ./env
conda list
conda update scikit-learn
conda list scikit-learn
conda search scikit-learn
conda search scikit-learn --info
# conda install python=3.6.9 scikit-learn=0.22Clean Data (empty or missing data)
-> Transform Data(computer understands)
-> Reduce Data(resource manage)
from sklearn.ensemble import RandomForestRegressor # it can predict numberR_squared = model.score(X_test, y_test) # Return the coefficient of determination of the prediction.RandomForrestRegressor is based on what we call a Decision Tree algorithm.
# evaluation 1
# Compare predictions to truth labels to evaluate the model
y_preds = clf.predict(X_test)
np.mean(y_preds == y_test)
# evaluation 2
clf.score(X_test, y_test)
# evaluation 3
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_preds)# predict_proba() returns probabilities of a classification label
clf.predict_proba(X_test[:5])
# array([[0.89, 0.11],
# [0.49, 0.51],
# [0.43, 0.57],
# [0.84, 0.16],
# [0.18, 0.82]])Use this kind of scoring strategy to avoid getting lucky score
from sklearn.model_selection import cross_val_score
cross_val_score(clf, X, y, cv=5)
# array([0.90322581, 0.83870968, 0.87096774, 0.9 , 0.86666667,
# 0.8 , 0.76666667, 0.83333333, 0.73333333, 0.83333333])- ROC and AUC, Clearly Explained! by StatQuest
- ROC documentation in Scikit-Learn (contains code examples)
- How the ROC curve and AUC are calculated by Google's Machine Learning team
# How to install a conda package into the current environment from a Jupyter Notebook
import sys
%conda install seaborn --yes --prefix {sys.prefix}# or install in terminal
conda install seaborn3.3. Metrics and scoring: quantifying the quality of predictions
Evaluating the results of a machine learning model is as important as building one.
But just like how different problems have different machine learning models, different machine learning models have different evaluation metrics.
Below are some of the most important evaluation metrics you'll want to look into for classification and regression models.
- Accuracy - The accuracy of the model in decimal form. Perfect accuracy is equal to 1.0.
- Precision - Indicates the proportion of positive identifications (model predicted class 1) which were actually correct. A model which produces no false positives has a precision of 1.0.
- Recall - Indicates the proportion of actual positives which were correctly classified. A model which produces no false negatives has a recall of 1.0.
- F1 score - A combination of precision and recall. A perfect model achieves an F1 score of 1.0.
- Confusion matrix - Compares the predicted values with the true values in a tabular way, if 100% correct, all values in the matrix will be top left to bottom right (diagonal line).
- Cross-validation - Splits your dataset into multiple parts and train and tests your model on each part then evaluates performance as an average.
- Classification report - Sklearn has a built-in function called
classification_report()which returns some of the main classification metrics such as precision, recall and f1-score. - ROC Curve - Also known as receiver operating characteristic is a plot of true positive rate versus false-positive rate.
- Area Under Curve (AUC) Score - The area underneath the ROC curve. A perfect model achieves an AUC score of 1.0.
- Accuracy is a good measure to start with if all classes are balanced (e.g. same amount of samples which are labelled with 0 or 1).
- Precision and recall become more important when classes are imbalanced.
- If false-positive predictions are worse than false-negatives, aim for higher precision.
- If false-negative predictions are worse than false-positives, aim for higher recall.
- F1-score is a combination of precision and recall.
- A confusion matrix is always a good way to visualize how a classification model is going.
- R^2 (pronounced r-squared) or the coefficient of determination - Compares your model's predictions to the mean of the targets. Values can range from negative infinity (a very poor model) to 1. For example, if all your model does is predict the mean of the targets, its R^2 value would be 0. And if your model perfectly predicts a range of numbers it's R^2 value would be 1.
- Mean absolute error (MAE) - The average of the absolute differences between predictions and actual values. It gives you an idea of how wrong your predictions were.
- Mean squared error (MSE) - The average squared differences between predictions and actual values. Squaring the errors removes negative errors. It also amplifies outliers (samples which have larger errors).
- R2 is similar to accuracy. It gives you a quick indication of how well your model might be doing. Generally, the closer your R2 value is to 1.0, the better the model. But it doesn't really tell exactly how wrong your model is in terms of how far off each prediction is.
- MAE gives a better indication of how far off each of your model's predictions are on average.
- As for MAE or MSE, because of the way MSE is calculated, squaring the differences between predicted values and actual values, it amplifies larger differences. Let's say we're predicting the value of houses (which we are).
- Pay more attention to MAE: When being $10,000 off is twice as bad as being $5,000 off.
- Pay more attention to MSE: When being $10,000 off is more than twice as bad as being $5,000 off.
For more resources on evaluating a machine learning model, be sure to check out the following resources:
- Scikit-Learn documentation for metrics and scoring (quantifying the quality of predictions)
- Beyond Accuracy: Precision and Recall by Will Koehrsen
- Stack Overflow answer describing MSE (mean squared error) and RSME (root mean squared error)
Google it
Top 6 Machine Learning Algorithms for Classification
Bulldozers price decision
- Train.csv is the training set, which contains data through the end of 2011.
- Valid.csv is the validation set, which contains data from January 1, 2012 - April 30, 2012 You make predictions on this set throughout the majority of the competition. Your score on this set is used to create the public leaderboard.
- Test.csv is the test set, which won't be released until the last week of the competition. It contains data from May 1, 2012 - November 2012. Your score on the test set determines your final rank for the competition.
- SalesID: the uniue identifier of the sale
- MachineID: the unique identifier of a machine. A machine can be sold multiple times
- saleprice: what the machine sold for at auction (only provided in train.csv)
- saledate: the date of the sale
- Google Colab
- New notbook
!unzip "drive/MyDrive/Colab Notebooks/data/dog-breed-identification.zip" -d "drive/MyDrive/Colab Notebooks/data/dog-vision"Runtime -> Change runtime type -> Hardware accelerator: GPU
- Google Colab short-cut list:
Comm + M + H - See the docstring:
Shift + Comm + Space
- Yann Lecun batch size
- Jeremy Howard batch size
- Review: MobileNetV2 — Light Weight Model (Image Classification)
- Convolutional Neural Networks — the ELI5 way
- Softmax function
| Binary classification | Multi-class classification | |
|---|---|---|
| Activation | sigmoid | softmax |
| Loss | Binary Cross Entropy | Category Cross Entropy |
Tensorflow - Early Stopping Callback
- Tensorflow save and load
- To save in the HDF5 format with a .h5 extension
- Tensowflow save and serialize
Attempt to run it on local
- pyenv path
.zshrc pyenv local 3.9.5pip install poetrypython -m pip install --upgrade pippython -m poetry install- fail!
The currently activated Python version 3.9.15 is not supported by the project (3.9).
Trying to find and use a compatible version.
Poetry was unable to find a compatible version. If you have one, you can explicitly use it via the "env use" command.\
- fail!
poetry env use /Users/noah/.pyenv/versions/3.9.15/bin/python3.9- ah... after chaging to
python = "3.9.15"frompython = "3.9"inpyproject.tomlfile, it works!
- ah... after chaging to
- However, eventually there's an tensorflow install error on mac
- Trying another model from TensorFlow Hub - Perhaps a different model would perform better on our dataset. One option would be to experiment with a different pre-trained model from TensorFlow Hub or look into the tf.keras.applications module.
- Data augmentation - Take the training images and manipulate (crop, resize) or distort them (flip, rotate) to create even more training data for the model to learn from. Check out the TensorFlow images documentation for a whole bunch of functions you can use on images. A great idea would be to try and replicate the techniques in this example cat vs. dog image classification notebook for our dog breeds problem.
- Fine-tuning - The model we used in this notebook was directly from TensorFlow Hub, we took what it had already learned from another dataset (ImageNet) and applied it to our own. Another option is to use what the model already knows and fine-tune this knowledge to our own dataset (pictures of dogs). This would mean all of the patterns within the model would be updated to be more specific to pictures of dogs rather than general images.
If you're after more, one of the best ways to find out something is to search for something like:
- "How to improve a TensorFlow 2.x image classification model?"
- "TensorFlow 2.x image classification best practices"
- "Transfer learning for image classification with TensorFlow 2.x"
- "Deep learning project examples with TensorFlow 2.x"
The TensorFlow developers have even put together a massive compilation of all of their favourite TensorFlow and machine learning resources.
When you see an example you think might be beyond your reach (because it looks too complicated), remember if in doubt, run the code. Try and reproduce what you see. This is the best way to get hands-on and build your own knowledge.
No one starts out knowing how to do everything single thing. They just get better are knowing what to look for.
And remember, if you have any questions, don't forget to send @mrdbourke or @AndreiNeagoie a message on Twitter or in the Discord chat!
- How to Think About Communicating and Sharing Your Technical Work
- Basecamp’s guide to internal communication
- You Should Blog by Jeremy Howard from fast.ai
- Why you (yes, you) should blog by Rachel Thomas from fast.ai
As you see, you do not need to have a degree in Mathematics to be a great Data Scientist. If you have finished the course and you are looking to expand your knowledge, or you are simply curious, Daniel and I recommend the below free resources. In our opinion, they are the BEST resources for you to learn these topics and have fun along the way without falling asleep:



