You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/ml-features.md
+140
Original file line number
Diff line number
Diff line change
@@ -1793,6 +1793,146 @@ for more details on the API.
1793
1793
</div>
1794
1794
</div>
1795
1795
1796
+
## ANOVASelector
1797
+
1798
+
`ANOVASelector` operates on categorical labels with continuous features. It uses the
1799
+
[one-way ANOVA F-test](https://en.wikipedia.org/wiki/F-test#Multiple-comparison_ANOVA_problems) to decide which
1800
+
features to choose.
1801
+
It supports five selection methods: `numTopFeatures`, `percentile`, `fpr`, `fdr`, `fwe`:
1802
+
*`numTopFeatures` chooses a fixed number of top features according to ANOVA F-test.
1803
+
*`percentile` is similar to `numTopFeatures` but chooses a fraction of all features instead of a fixed number.
1804
+
*`fpr` chooses all features whose p-values are below a threshold, thus controlling the false positive rate of selection.
1805
+
*`fdr` uses the [Benjamini-Hochberg procedure](https://en.wikipedia.org/wiki/False_discovery_rate#Benjamini.E2.80.93Hochberg_procedure) to choose all features whose false discovery rate is below a threshold.
1806
+
*`fwe` chooses all features whose p-values are below a threshold. The threshold is scaled by 1/numFeatures, thus controlling the family-wise error rate of selection.
1807
+
By default, the selection method is `numTopFeatures`, with the default number of top features set to 50.
1808
+
The user can choose a selection method using `setSelectorType`.
1809
+
1810
+
**Examples**
1811
+
1812
+
Assume that we have a DataFrame with the columns `id`, `features`, and `label`, which is used as
1813
+
our target to be predicted:
1814
+
1815
+
~~~
1816
+
id | features | label
1817
+
---|--------------------------------|---------
1818
+
1 | [1.7, 4.4, 7.6, 5.8, 9.6, 2.3] | 3.0
1819
+
2 | [8.8, 7.3, 5.7, 7.3, 2.2, 4.1] | 2.0
1820
+
3 | [1.2, 9.5, 2.5, 3.1, 8.7, 2.5] | 3.0
1821
+
4 | [3.7, 9.2, 6.1, 4.1, 7.5, 3.8] | 2.0
1822
+
5 | [8.9, 5.2, 7.8, 8.3, 5.2, 3.0] | 4.0
1823
+
6 | [7.9, 8.5, 9.2, 4.0, 9.4, 2.1] | 4.0
1824
+
~~~
1825
+
1826
+
If we use `ANOVASelector` with `numTopFeatures = 1`, the
1827
+
last column in our `features` is chosen as the most useful feature:
`FValueSelector` operates on categorical labels with continuous features. It uses the
1869
+
[F-test for regression](https://en.wikipedia.org/wiki/F-test#Regression_problems) to decide which
1870
+
features to choose.
1871
+
It supports five selection methods: `numTopFeatures`, `percentile`, `fpr`, `fdr`, `fwe`:
1872
+
*`numTopFeatures` chooses a fixed number of top features according to a F-test for regression.
1873
+
*`percentile` is similar to `numTopFeatures` but chooses a fraction of all features instead of a fixed number.
1874
+
*`fpr` chooses all features whose p-values are below a threshold, thus controlling the false positive rate of selection.
1875
+
*`fdr` uses the [Benjamini-Hochberg procedure](https://en.wikipedia.org/wiki/False_discovery_rate#Benjamini.E2.80.93Hochberg_procedure) to choose all features whose false discovery rate is below a threshold.
1876
+
*`fwe` chooses all features whose p-values are below a threshold. The threshold is scaled by 1/numFeatures, thus controlling the family-wise error rate of selection.
1877
+
By default, the selection method is `numTopFeatures`, with the default number of top features set to 50.
1878
+
The user can choose a selection method using `setSelectorType`.
1879
+
1880
+
**Examples**
1881
+
1882
+
Assume that we have a DataFrame with the columns `id`, `features`, and `label`, which is used as
1883
+
our target to be predicted:
1884
+
1885
+
~~~
1886
+
id | features | label
1887
+
---|--------------------------------|---------
1888
+
1 | [6.0, 7.0, 0.0, 7.0, 6.0, 0.0] | 4.6
1889
+
2 | [0.0, 9.0, 6.0, 0.0, 5.0, 9.0] | 6.6
1890
+
3 | [0.0, 9.0, 3.0, 0.0, 5.0, 5.0] | 5.1
1891
+
4 | [0.0, 9.0, 8.0, 5.0, 6.0, 4.0] | 7.6
1892
+
5 | [8.0, 9.0, 6.0, 5.0, 4.0, 4.0] | 9.0
1893
+
6 | [8.0, 9.0, 6.0, 4.0, 0.0, 0.0] | 9.0
1894
+
~~~
1895
+
1896
+
If we use `FValueSelector` with `numTopFeatures = 1`, the
1897
+
3rd column in our `features` is chosen as the most useful feature:
0 commit comments