Skip to content

Commit a1236fe

Browse files
committed
simplified iris task
1 parent 0f78445 commit a1236fe

File tree

3 files changed

+11
-95
lines changed

3 files changed

+11
-95
lines changed

README.md

Lines changed: 9 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# k-Nearest Neighbors Classification Exercise
22

3-
Today we will get to know the package `scikit-learn` (sklearn). It has many different machine learning algorithms already implemented, so we will be using it for the next five exercise sheets. The first algorithm, which we are going to learn today is the k-nearest neighbor algorithm. It can be used for classification as well as for regression.
3+
Today we will get to know the package `scikit-learn` (sklearn). It has many different machine learning algorithms already implemented. The first algorithm, which we are going to learn today is the k-nearest neighbor algorithm. It can be used for classification as well as for regression.
44

55
Take a look at the file `src/nn_iris.py`. We will implement the TODOs step by step:
66

@@ -13,25 +13,15 @@ Take a look at the file `src/nn_iris.py`. We will implement the TODOs step by st
1313
or directly via `pip install scikit-learn`.
1414
The dataset <em>iris</em> is very popular amongst machine learners in example tasks. For this reason it can be found directly in the sklearn package.
1515

16-
2. Navigate to the `__main__` function of `src/nn_iris.py` and load the iris dataset from `sklearn.datasets`.
16+
2. Navigate to the `__main__` function of `src/nn_iris.py`. At first we load the iris dataset from `sklearn.datasets`.
1717
In the dataset there are several plants of different species of genus Iris. For each of the examples width and length of petal and sepal of the flower were measured.
1818
![A petal and a sepal of a flower (Wikipedia)](./figures/Petal_sepal.jpg)
1919

2020
3. Find out how to access the attributes of the database (Hint: set a breakpoint and examine the variable). Print the shape of the data matrix and the number of the target entries. Print the names of the labels. Print the names of the features.
2121

22-
### Task 2: Examining the data (optional)
23-
24-
Your goal is to determine the species for an example, based on the dimensions of its petals and sepals. But first we need to inspect the dataset.
25-
26-
1. Use a histogram (classes distribution) to check if the iris dataset is balanced. To plot a histogram you can for example use `pandas.Series.hist` or `matplotlib.pyplot.hist`.
27-
Fortunately, the iris dataset is balanced, so it has the same number of samples for each species. Balanced datasets make it simple to proceed directly to the classification phase. In the opposite case we would have to take additional steps to reduce the negative effects (e.g. collect more data) or use other algorithms than the k-Nearest Neighbors (e.g. Random Forests).
28-
29-
2. We also can use pandas `scatter_matrix` to visualize some trends in our data. A scatter matrix (pairs plot) compactly plots all the numeric variables we have in a dataset against each other.
30-
Plot the scatter matrix. To make the different species visually distinguishable use the parameter `c=iris.target` in `pandas.plotting.scatter_matrix` to colorize the datapoints according to their target species.
31-
In the scatter matrix you can see domains of values as well as the distributions of each of the attributes. It is also possible to compare groups in scatter plots over all pairs of attributes. From those it seems that groups are well separated, two of the groups slightly overlap.
32-
33-
### Task 3: Training
22+
### Task 2: Training
3423

24+
Your goal is to determine the species for an example, based on the dimensions of its petals and sepals.
3525
First, we need to split the dataset into train and test data. Then we are ready to train the model.
3626

3727
1. Use `train_test_split` from `sklearn.model_selection` and create a train and a test set with the ratio 75:25. Print the dimensions of the train and the test set. You can use the parameter `random_state` to set the seed for the random number generator. That will make your results reproducible. Set this value to 29.
@@ -40,36 +30,30 @@ First, we need to split the dataset into train and test data. Then we are ready
4030

4131
3. Train the classifier on the training set. The method `fit()` is present in all the estimators of the package `scikit-learn`.
4232

43-
### Task 4: Prediction and Evalutation
33+
### Task 3: Prediction and Evalutation
4434

4535
The trained model is now able to receive the input data and produce predictions of the labels.
4636
1. Predict the labels first for the train and then for the test data.
4737

4838
2. The comparison of a predicted and the true label can tell us valuable information about how well our model performs. The simplest performance measure is the ratio of correct predictions to all predictions, called accuracy. Implement a function `compute_accuracy` to calculate the accuracy of predictions. Use your function and evaluate your model by calculating the accuracy on the train set and the test set. Print both results.
4939

50-
3. (Optional) To evaluate, whether our model performs well, its performance is compared to other models. Since we now only know one classifier, we will compare it to dummy models. Most frequent models always predict the label that occurs the most in our train set. If the train set is balanced, we choose one of the classes. Implement the function `accuracy_most_frequent` to compute the accuracy of the most frequent model. (Hint: the function `numpy.bincount` might be helpful.) Print the result.
51-
52-
4. (Optional) Another dummy model is a stratified model. A stratified model assigns random labels based on the ratio of the labels in the train set. Implement the function `accuracy_stratified` to compute the accuracy of the stratified model. (Hint: `numpy.random.choice` might help.) Call the function several times and print the results. You see that the results are different. In order to reproduce the results, it is usefull to set a seed. Use `numpy.random.seed` before calling the function to set the seed. Set it to 29.
53-
54-
### Task 5: Confusion matrix
40+
### Task 4: Confusion matrix
5541

5642
Another common method to evaluate the performance of a classifier is constructing a confusion matrix that shows not only accuracies for each of the classes (labels), but what classes the classifier is most confused about.
5743

5844
1. Use the function `confusion_matrix` to compute the confusion matrix for the test set.
5945

60-
2. (Optional) The accuracy of the prediction can be derived from the confusion matrix as sum of the matrix diagonal over the sum of the whole matrix. Compute the accuracy using the information obtained from the confusion matrix. Print the result.
61-
62-
3. We can also visualize the confusion matrix in form of a heatmap. Use `ConfusionMatrixDisplay` to plot a heatmap of the confusion matrix for the test set. Use `display_labels=iris.target_names` for better visualization.
46+
2. We can also visualize the confusion matrix in form of a heatmap. Use `ConfusionMatrixDisplay` to plot a heatmap of the confusion matrix for the test set. Use `display_labels=iris.target_names` for better visualization.
6347

64-
### Task 6: Hyperparameter tuning
48+
### Task 5: Hyperparameter tuning
6549

6650
Now we need to find the best value for our hyperparameter `k`. We will use a common procedure called <em>grid search</em> to search the space of the possible values. Since our train dataset is small, we will perform cross-validation in order to compute the validation error for each value of `k`. Implement this hyperparameter tuning in the function `cv_knearest_classifier` following these steps:
6751

6852
1. Define a second classifier `knn2`. Define a grid of parameter values for `k` from 1 to 25 (Hint: `numpy.arange`). This grid must be stored in a dictionary with `n_neighbors` as the key in order to use `GridSearchCV` with it.
6953

7054
2. Use the class `GridSearchCV` to perform grid search. It gives you the possibility to perform n-fold cross-validation too, so use the attribute `cv` to set the number of folds to 3. When everything is set, you can train your `knn2`.
7155

72-
### Task 7: Testing
56+
### Task 6: Testing
7357

7458
After the training you can access the best parameter `best_params_`, the corresponding validation accuracy `best_score_` and the corresponding estimator `best_estimator_`.
7559

src/nn_iris.py

Lines changed: 1 addition & 56 deletions
Original file line numberDiff line numberDiff line change
@@ -23,37 +23,6 @@ def compute_accuracy(y: np.ndarray, y_pred: np.ndarray) -> float:
2323
# TODO: Implement me.
2424
return None
2525

26-
def accuracy_most_frequent(y_train: np.ndarray, y_test: np.ndarray) -> float:
27-
"""Compute the accuracy of the most frequent model.
28-
29-
Most frequent models always predict the label that occurs the most in the train set.
30-
They belong to the class of so-called "dummy models", because the prediction is
31-
independent of the input. Such models usually serve as a baseline if
32-
no other models are available.
33-
34-
Args:
35-
y_train (np.ndarray): The array with the training labels.
36-
y_test (np.ndarray): The array with the test labels.
37-
Returns:
38-
float: The accuracy of the prediction by the most frequent model on the test labels.
39-
"""
40-
# TODO: Implement me.
41-
return None
42-
43-
# optional
44-
def accuracy_stratified(y_train: np.ndarray, y_test: np.ndarray) -> float:
45-
"""Compute the accuracy of the stratified model.
46-
47-
A stratified model assigns random labels based on the ratio of the labels in the train set.
48-
49-
Args:
50-
y_train (np.ndarray): The array with the training labels.
51-
y_test (np.ndarray): The array with the test labels.
52-
Returns:
53-
float: The accuracy of the prediction by the stratified model on the test labels.
54-
"""
55-
# TODO: Implement me.
56-
return None
5726

5827
def cv_knearest_classifier(x_train: np.ndarray, y_train: np.ndarray) -> GridSearchCV:
5928
"""Train and cross-validate a k-nearest neighbors classifier with the grid search.
@@ -80,26 +49,13 @@ def cv_knearest_classifier(x_train: np.ndarray, y_train: np.ndarray) -> GridSear
8049

8150
if __name__ == "__main__":
8251
# load iris dataset
83-
iris = # TODO
52+
iris = load_iris()
8453

8554
# print shape of data matrix and number of target entries
8655
# TODO
8756
# print names of labels and of features
8857
# TODO
8958

90-
# (optional) use classes distribution (histogram) to check if iris dataset is balanced:
91-
# find out what the next two lines of code do
92-
temp = pd.Series(iris.target)
93-
target_str = temp.apply(lambda i: iris.target_names[i])
94-
# and use 'pandas.Series.hist' function to plot histogram
95-
# TODO
96-
97-
# (optional) use pandas 'scatter_matrix' to visualize some trends in data:
98-
# represent iris as pandas data frame
99-
# TODO
100-
# create scatter matrix from dataframe, color by target; plot matrix
101-
# TODO
102-
10359
# create train and test split with the ratio 75:25 and print their dimensions
10460
# TODO
10561

@@ -117,20 +73,9 @@ def cv_knearest_classifier(x_train: np.ndarray, y_train: np.ndarray) -> GridSear
11773
# print both accuracies
11874
# TODO
11975

120-
# implement and use 'accuracy_most_frequent' to compute accuracy of most frequent model
121-
# TODO
122-
# print result
123-
# TODO
124-
125-
# (optional) implement and use 'accuracy_stratified' to compute and print accuracy of the stratified model
126-
# TODO
127-
12876
# compute confusion matrix for test set
12977
# TODO
13078

131-
# (optional) compute and print test set accuracy from confusion matrix
132-
# TODO
133-
13479
# plot heatmap of confusion matrix for test set
13580
# TODO
13681

tests/test_nn_iris.py

Lines changed: 1 addition & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@
88

99
sys.path.insert(0, "./src/")
1010

11-
from src.nn_iris import compute_accuracy, accuracy_most_frequent, accuracy_stratified, cv_knearest_classifier
11+
from src.nn_iris import compute_accuracy, cv_knearest_classifier
1212

1313

1414
def test_compute_accuracy():
@@ -17,19 +17,6 @@ def test_compute_accuracy():
1717
acc = compute_accuracy(y, y_pred)
1818
assert np.allclose(acc, 0.8)
1919

20-
def test_accuracy_most_frequent():
21-
ytrain = np.array([0, 1, 0, 1, 0, 1, 0, 1, 1])
22-
ytest = np.array([1, 0, 1, 1, 1])
23-
acc_mf = accuracy_most_frequent(ytrain, ytest)
24-
assert np.allclose(acc_mf, 0.8)
25-
26-
def test_accuracy_stratified():
27-
np.random.seed(42) # Set the random seed for reproducibility
28-
ytrain = np.array([0, 1, 0, 1, 0, 1, 0, 1, 1])
29-
ytest = np.array([1, 0, 1, 1, 1])
30-
acc_strat = accuracy_stratified(ytrain, ytest)
31-
assert np.allclose(acc_strat, 0.4)
32-
3320
def test_cv_knearest_classifier():
3421
# Create a dummy dataset
3522
X, y = make_classification(n_samples=100, n_features=20, random_state=42)

0 commit comments

Comments
 (0)