|
68 | 68 | "id": "DauaqJ7WhIhO"
|
69 | 69 | },
|
70 | 70 | "source": [
|
71 |
| - "This guide demonstrates how to use the [TensorFlow Core low-level APIs](https://www.tensorflow.org/guide/core) to perform [binary classification](https://developers.google.com/machine-learning/glossary#binary_classification){:.external} with [logistic regression](https://developers.google.com/machine-learning/crash-course/logistic-regression/){:.external}. It uses the [Wisconsin Breast Cancer Dataset](https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+(original)){:.external} for tumor classification.\n", |
| 71 | + "This guide demonstrates how to use the [TensorFlow Core low-level APIs](https://www.tensorflow.org/guide/core) to perform [binary classification](https://developers.google.com/machine-learning/glossary#binary_classification) with [logistic regression](https://developers.google.com/machine-learning/crash-course/logistic-regression/). It uses the [Wisconsin Breast Cancer Dataset](https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+(original)) for tumor classification.\n", |
72 | 72 | "\n",
|
73 |
| - "[Logistic regression](https://developers.google.com/machine-learning/crash-course/logistic-regression/){:.external} is one of the most popular algorithms for binary classification. Given a set of examples with features, the goal of logistic regression is to output values between 0 and 1, which can be interpreted as the probabilities of each example belonging to a particular class. " |
| 73 | + "[Logistic regression](https://developers.google.com/machine-learning/crash-course/logistic-regression/) is one of the most popular algorithms for binary classification. Given a set of examples with features, the goal of logistic regression is to output values between 0 and 1, which can be interpreted as the probabilities of each example belonging to a particular class. " |
74 | 74 | ]
|
75 | 75 | },
|
76 | 76 | {
|
|
81 | 81 | "source": [
|
82 | 82 | "## Setup\n",
|
83 | 83 | "\n",
|
84 |
| - "This tutorial uses [pandas](https://pandas.pydata.org){:.external} for reading a CSV file into a [DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html){:.external}, [seaborn](https://seaborn.pydata.org){:.external} for plotting a pairwise relationship in a dataset, [Scikit-learn](https://scikit-learn.org/){:.external} for computing a confusion matrix, and [matplotlib](https://matplotlib.org/){:.external} for creating visualizations." |
| 84 | + "This tutorial uses [pandas](https://pandas.pydata.org) for reading a CSV file into a [DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html), [seaborn](https://seaborn.pydata.org) for plotting a pairwise relationship in a dataset, [Scikit-learn](https://scikit-learn.org/) for computing a confusion matrix, and [matplotlib](https://matplotlib.org/) for creating visualizations." |
85 | 85 | ]
|
86 | 86 | },
|
87 | 87 | {
|
|
128 | 128 | "source": [
|
129 | 129 | "## Load the data\n",
|
130 | 130 | "\n",
|
131 |
| - "Next, load the [Wisconsin Breast Cancer Dataset](https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+(original)){:.external} from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/){:.external}. This dataset contains various features such as a tumor's radius, texture, and concavity." |
| 131 | + "Next, load the [Wisconsin Breast Cancer Dataset](https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+(original)) from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/). This dataset contains various features such as a tumor's radius, texture, and concavity." |
132 | 132 | ]
|
133 | 133 | },
|
134 | 134 | {
|
|
156 | 156 | "id": "A3VR1aTP92nV"
|
157 | 157 | },
|
158 | 158 | "source": [
|
159 |
| - "Read the dataset into a pandas [DataFrame](){:.external} using [`pandas.read_csv`](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html){:.external}:" |
| 159 | + "Read the dataset into a pandas [DataFrame]() using [`pandas.read_csv`](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html):" |
160 | 160 | ]
|
161 | 161 | },
|
162 | 162 | {
|
|
207 | 207 | "id": "s4-Wn2jzVC1W"
|
208 | 208 | },
|
209 | 209 | "source": [
|
210 |
| - "Split the dataset into training and test sets using [`pandas.DataFrame.sample`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sample.html){:.external}, [`pandas.DataFrame.drop`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html){:.external} and [`pandas.DataFrame.iloc`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iloc.html){:.external}. Make sure to split the features from the target labels. The test set is used to evaluate your model's generalizability to unseen data." |
| 210 | + "Split the dataset into training and test sets using [`pandas.DataFrame.sample`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sample.html), [`pandas.DataFrame.drop`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html) and [`pandas.DataFrame.iloc`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iloc.html). Make sure to split the features from the target labels. The test set is used to evaluate your model's generalizability to unseen data." |
211 | 211 | ]
|
212 | 212 | },
|
213 | 213 | {
|
|
277 | 277 | "\n",
|
278 | 278 | "This dataset contains the mean, standard error, and largest values for each of the 10 tumor measurements collected per example. The `\"diagnosis\"` target column is a categorical variable with `'M'` indicating a malignant tumor and `'B'` indicating a benign tumor diagnosis. This column needs to be converted into a numerical binary format for model training.\n",
|
279 | 279 | "\n",
|
280 |
| - "The [`pandas.Series.map`](https://pandas.pydata.org/docs/reference/api/pandas.Series.map.html){:.external} function is useful for mapping binary values to the categories.\n", |
| 280 | + "The [`pandas.Series.map`](https://pandas.pydata.org/docs/reference/api/pandas.Series.map.html) function is useful for mapping binary values to the categories.\n", |
281 | 281 | "\n",
|
282 | 282 | "The dataset should also be converted to a tensor with the `tf.convert_to_tensor` function after the preprocessing is complete."
|
283 | 283 | ]
|
|
301 | 301 | "id": "J4ubs136WLNp"
|
302 | 302 | },
|
303 | 303 | "source": [
|
304 |
| - "Use [`seaborn.pairplot`](https://seaborn.pydata.org/generated/seaborn.pairplot.html){:.external} to review the joint distribution of a few pairs of mean-based features from the training set and observe how they relate to the target:" |
| 304 | + "Use [`seaborn.pairplot`](https://seaborn.pydata.org/generated/seaborn.pairplot.html) to review the joint distribution of a few pairs of mean-based features from the training set and observe how they relate to the target:" |
305 | 305 | ]
|
306 | 306 | },
|
307 | 307 | {
|
|
343 | 343 | "id": "_8pDCIFjMla8"
|
344 | 344 | },
|
345 | 345 | "source": [
|
346 |
| - "Given the inconsistent ranges, it is beneficial to standardize the data such that each feature has a zero mean and unit variance. This process is called [normalization](https://developers.google.com/machine-learning/glossary#normalization){:.external}." |
| 346 | + "Given the inconsistent ranges, it is beneficial to standardize the data such that each feature has a zero mean and unit variance. This process is called [normalization](https://developers.google.com/machine-learning/glossary#normalization)." |
347 | 347 | ]
|
348 | 348 | },
|
349 | 349 | {
|
|
384 | 384 | "\n",
|
385 | 385 | "### Logistic regression fundamentals\n",
|
386 | 386 | "\n",
|
387 |
| - "Linear regression returns a linear combination of its inputs; this output is unbounded. The output of a [logistic regression](https://developers.google.com/machine-learning/glossary#logistic_regression){:.external} is in the `(0, 1)` range. For each example, it represents the probability that the example belongs to the _positive_ class.\n", |
| 387 | + "Linear regression returns a linear combination of its inputs; this output is unbounded. The output of a [logistic regression](https://developers.google.com/machine-learning/glossary#logistic_regression) is in the `(0, 1)` range. For each example, it represents the probability that the example belongs to the _positive_ class.\n", |
388 | 388 | "\n",
|
389 | 389 | "Logistic regression maps the continuous outputs of traditional linear regression, `(-∞, ∞)`, to probabilities, `(0, 1)`. This transformation is also symmetric so that flipping the sign of the linear output results in the inverse of the original probability.\n",
|
390 | 390 | "\n",
|
391 |
| - "Let $Y$ denote the probability of being in class `1` (the tumor is malignant). The desired mapping can be achieved by interpreting the linear regression output as the [log odds](https://developers.google.com/machine-learning/glossary#log-odds){:.external} ratio of being in class `1` as opposed to class `0`:\n", |
| 391 | + "Let $Y$ denote the probability of being in class `1` (the tumor is malignant). The desired mapping can be achieved by interpreting the linear regression output as the [log odds](https://developers.google.com/machine-learning/glossary#log-odds) ratio of being in class `1` as opposed to class `0`:\n", |
392 | 392 | "\n",
|
393 | 393 | "$$\\ln(\\frac{Y}{1-Y}) = wX + b$$\n",
|
394 | 394 | "\n",
|
395 | 395 | "By setting $wX + b = z$, this equation can then be solved for $Y$:\n",
|
396 | 396 | "\n",
|
397 | 397 | "$$Y = \\frac{e^{z}}{1 + e^{z}} = \\frac{1}{1 + e^{-z}}$$\n",
|
398 | 398 | "\n",
|
399 |
| - "The expression $\\frac{1}{1 + e^{-z}}$ is known as the [sigmoid function](https://developers.google.com/machine-learning/glossary#sigmoid_function){:.external} $\\sigma(z)$. Hence, the equation for logistic regression can be written as $Y = \\sigma(wX + b)$.\n", |
| 399 | + "The expression $\\frac{1}{1 + e^{-z}}$ is known as the [sigmoid function](https://developers.google.com/machine-learning/glossary#sigmoid_function) $\\sigma(z)$. Hence, the equation for logistic regression can be written as $Y = \\sigma(wX + b)$.\n", |
400 | 400 | "\n",
|
401 | 401 | "The dataset in this tutorial deals with a high-dimensional feature matrix. Therefore, the above equation must be rewritten in a matrix vector form as follows:\n",
|
402 | 402 | "\n",
|
|
437 | 437 | "source": [
|
438 | 438 | "### The log loss function\n",
|
439 | 439 | "\n",
|
440 |
| - "The [log loss](https://developers.google.com/machine-learning/glossary#Log_Loss){:.external}, or binary cross-entropy loss, is the ideal loss function for a binary classification problem with logistic regression. For each example, the log loss quantifies the similarity between a predicted probability and the example's true value. It is determined by the following equation:\n", |
| 440 | + "The [log loss](https://developers.google.com/machine-learning/glossary#Log_Loss), or binary cross-entropy loss, is the ideal loss function for a binary classification problem with logistic regression. For each example, the log loss quantifies the similarity between a predicted probability and the example's true value. It is determined by the following equation:\n", |
441 | 441 | "\n",
|
442 | 442 | "$$L = -\\frac{1}{m}\\sum_{i=1}^{m}y_i\\cdot\\log(\\hat{y}_i) + (1- y_i)\\cdot\\log(1 - \\hat{y}_i)$$\n",
|
443 | 443 | "\n",
|
|
471 | 471 | "source": [
|
472 | 472 | "### The gradient descent update rule\n",
|
473 | 473 | "\n",
|
474 |
| - "The TensorFlow Core APIs support automatic differentiation with `tf.GradientTape`. If you are curious about the mathematics behind the logistic regression [gradient updates](https://developers.google.com/machine-learning/glossary#gradient_descent){:.external}, here is a short explanation:\n", |
| 474 | + "The TensorFlow Core APIs support automatic differentiation with `tf.GradientTape`. If you are curious about the mathematics behind the logistic regression [gradient updates](https://developers.google.com/machine-learning/glossary#gradient_descent), here is a short explanation:\n", |
475 | 475 | "\n",
|
476 | 476 | "In the above equation for the log loss, recall that each $\\hat{y}_i$ can be rewritten in terms of the inputs as $\\sigma({\\mathrm{X_i}}w + b)$.\n",
|
477 | 477 | "\n",
|
|
754 | 754 | "\n",
|
755 | 755 | "For this problem, the FPR is the proportion of malignant tumor predictions amongst tumors that are actually benign. Conversely, the FNR is the proportion of benign tumor predictions among tumors that are actually malignant.\n",
|
756 | 756 | "\n",
|
757 |
| - "Compute a confusion matrix using [`sklearn.metrics.confusion_matrix`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html#sklearn.metrics.confusion_matrix){:.external}, which evaluates the accuracy of the classification, and use matplotlib to display the matrix:" |
| 757 | + "Compute a confusion matrix using [`sklearn.metrics.confusion_matrix`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html#sklearn.metrics.confusion_matrix), which evaluates the accuracy of the classification, and use matplotlib to display the matrix:" |
758 | 758 | ]
|
759 | 759 | },
|
760 | 760 | {
|
|
0 commit comments