Skip to content

Potential issues in Classification I #361

Open
@trevorcampbell

Description

@trevorcampbell

Two more potential issues noted by Hossein Saiedian from KU. The first is because we don't want to get into the splitting procedure in Classification I, but it could probably be fixed with a minor edit given some careful thought. The second I'll need to investigate further, this may just be updated software versions or something.

  1. In Section 5.3, the explanation of splitting data into training and test sets is great at setting expectations. However, in Section 5.6, the KNN classifier is trained on cancer_train (the full, filtered cancer dataset) without explicitly showing or mentioning a split. For beginners, it might be a bit confusing since they might wonder when the split occurred. Perhaps a quick note mentioning that the split is skipped here for simplicity could make things clearer.
  2. In Section 5.6, it's mentioned that set_config(transform_output="pandas") ensures scikit-learn outputs are pandas DataFrames. However, the output of knn.predict(new_obs) remains a NumPy array, not a DataFrame. This might be misleading, since set_config only affects transformer outputs, not predictions from estimators. I clarified with my students that this setting applies to preprocessing steps, not .predict().I've attached my classroom slides with some notes to help illustrate these points.

As usual we should sync with the R version if we change anything here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingneeds-investigationFurther information is requested

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions