Open
Description
Two more potential issues noted by Hossein Saiedian from KU. The first is because we don't want to get into the splitting procedure in Classification I, but it could probably be fixed with a minor edit given some careful thought. The second I'll need to investigate further, this may just be updated software versions or something.
- In Section 5.3, the explanation of splitting data into training and test sets is great at setting expectations. However, in Section 5.6, the KNN classifier is trained on
cancer_train
(the full, filtered cancer dataset) without explicitly showing or mentioning a split. For beginners, it might be a bit confusing since they might wonder when the split occurred. Perhaps a quick note mentioning that the split is skipped here for simplicity could make things clearer. - In Section 5.6, it's mentioned that
set_config(transform_output="pandas")
ensures scikit-learn outputs are pandas DataFrames. However, the output ofknn.predict(new_obs)
remains a NumPy array, not a DataFrame. This might be misleading, sinceset_config
only affects transformer outputs, not predictions from estimators. I clarified with my students that this setting applies to preprocessing steps, not.predict()
.I've attached my classroom slides with some notes to help illustrate these points.
As usual we should sync with the R version if we change anything here.