TransProPyBook/UtilsFunction3.qmd at main · SSSYDYSSS/TransProPyBook · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
# UtilsFunction3

This section serves as the helper function for the AutoFeatureSelection.py function.

## LoadFilterTranspose.py
Remove samples with high zero expression.

### Parameters
>- <b style="color: #43799a;">data_path: string</b>:
>   - <b style="color: #54467d;">For example</b>: '../data/gene_tpm.csv'
>   - <b style="color: #54467d;">Please note</b>: The input data matrix should have genes as rows and samples as columns.
>- <b style="color: #43799a;">threshold: float</b>:
>   - <b style="color: #54467d;">For example</b>: 0.9
>   - The set threshold indicates the proportion of non-zero value samples to all samples in each feature.

### Returns
>- <b style="color: #43799a;">X</b> (pandas.core.frame.DataFrame):

### Usage
```python
load_filter_transpose(
    threshold=0.9,
    data_path='../data/gene_tpm.csv'
    )
```

## LoadEncodeLabels.py
Reads a CSV file containing labels and encodes categorical labels in the specified column to numeric labels.

### Parameters
>- <b style="color: #43799a;">file_path (str)</b>:
>   - Path to the CSV file containing labels.
>- <b style="color: #43799a;">column_name (str)</b>:
>   - Name of the column to be encoded.

### Returns
>- <b style="color: #43799a;">Y</b> (pd.DataFrame):
>   - A DataFrame containing the encoded numeric labels.

### Usage
```python
load_encode_labels(
    file_path='../data/class.csv',
    column_name='class'
    )
```

## ExtractCommonSamples.py
Extracts common samples (rows) from two DataFrames based on their indices.

### Parameters
>- <b style="color: #43799a;">X (pd.DataFrame)</b>:
>   - First DataFrame.
>- <b style="color: #43799a;">Y (pd.DataFrame)</b>:
>   - Second DataFrame.

### Returns
>- <b style="color: #43799a;">X_common, Y_common</b> (pd.DataFrame):
>   - Two DataFrames containing only the rows that are common in both.

### Usage
```python
extract_common_samples(
    X,
    Y
    )
```

## LoadAndPreprocessData.py
Load and preprocess the data.

### Parameters
>- <b style="color: #43799a;">feature_file: str</b>:
>   - Path to the feature data file.
>- <b style="color: #43799a;">label_file: str</b>:
>   - Path to the label data file.
>- <b style="color: #43799a;">label_column: str</b>:
>   - Column name of the labels in the label file.
>- <b style="color: #43799a;">threshold: float</b>:
>   - Threshold for filtering in load_filter_transpose function.

### Returns
>- <b style="color: #43799a;">X</b> (DataFrame):
>   - Preprocessed feature data.
>- <b style="color: #43799a;">Y</b> (ndarray):
>   - Preprocessed label data.

### Usage
```python
load_and_preprocess_data(
    feature_file,
    label_file,
    label_column,
    threshold
    )
```

## SetupLoggingAndProgressBar.py
Set up logging and initialize a tqdm progress bar.

### Parameters
>- <b style="color: #43799a;">n_iter (int)</b>:
>   - Number of iterations for RandomizedSearchCV.
>- <b style="color: #43799a;">n_cv (int)</b>:
>   - Number of cross-validation folds.

### Returns
>- <b style="color: #43799a;">tqdm object</b>
>   - An initialized tqdm progress bar.

### Usage
```python
setup_logging_and_progress_bar(
    n_iter,
    n_cv
    )
```

## UpdateProgressBar.py
Read the number of log entries in the log file and update the tqdm progress bar.

### Parameters
>- <b style="color: #43799a;">pbar (tqdm)</b>:
>   - The tqdm progress bar object.
>- <b style="color: #43799a;">log_file (str)</b>:
>   - Path to the log file, default is 'progress.log'.

### Usage
```python
update_progress_bar(
    pbar,
    log_file='progress.log'
    )
```

## LoggingCustomScorer.py
Creates a custom scorer function for use in model evaluation processes.
This scorer logs both the accuracy score and the time taken for each call.

### Parameters
>- <b style="color: #43799a;">n_iter (int)</b>:
>   - Number of iterations for the search process. Default is 10.
>- <b style="color: #43799a;">n_cv (int)</b>:
>   - Number of cross-validation splits. Default is 5.

### Returns
>- <b style="color: #43799a;">custom_scorer(function)</b>
>   - A custom scorer function that logs the accuracy score and time taken for each call.

### Usage
```python
logging_custom_scorer(
    n_iter=10,
    n_cv=5
    )
```

## TqdmCustomScorer.py
> Creates a custom scorer for model evaluation, integrating a progress bar with `tqdm`.

### Parameters
>- <b style="color: #43799a;">n_iter: int</b> (optional):
>   - Number of iterations for the search process. Default is 10.
>- <b style="color: #43799a;">n_cv: int</b> (optional):
>   - Number of cross-validation splits. Default is 5.

### Returns
>- <b style="color: #43799a;">function</b>:
>   - A custom scorer function that can be used with model evaluation methods like `RandomizedSearchCV`.

### Description
> The `tqdm_custom_scorer` function creates a scorer for model evaluation, incorporating a `tqdm` progress bar to monitor the evaluation process. This scorer is especially useful in processes like `RandomizedSearchCV`, where it provides real-time feedback on the number of iterations and cross-validation steps completed.

### Usage
```python
custom_scorer = tqdm_custom_scorer(n_iter=10, n_cv=5)
# Use this scorer in RandomizedSearchCV or similar methods
```


## TrainModel.py
Set up and run the model training process.

### Parameters
>- <b style="color: #43799a;">X: DataFrame</b>:
>   - feature data.
>- <b style="color: #43799a;">Y: ndarray</b>:
>   - label data.
>- <b style="color: #43799a;">feature_selection</b>:
>   - FeatureUnion, the feature selection process.
>- <b style="color: #43799a;">parameters: dict</b>:
>   - parameters for RandomizedSearchCV.
>- <b style="color: #43799a;">n_iter: int</b>:
>   - number of iterations for RandomizedSearchCV.
>- <b style="color: #43799a;">n_cv: int</b>:
>   - number of cross-validation folds.
>- <b style="color: #43799a;">n_jobs: int</b>:
>   - number of jobs to run in parallel (default is 9).

### Returns
>- <b style="color: #43799a;">clf</b>
>   - RandomizedSearchCV object after fitting.

### Usage
```python
train_model(
    X,
    Y,
    feature_selection,
    parameters,
    n_iter,
    n_cv,
    n_jobs=9
    )
```

## EnsembleForRFE.py
Set up and run the Ensemble model for Recursive Feature Elimination.

### Parameters
>- <b style="color: #43799a;">svm_C: float</b>:
>  - Regularization parameter for SVM.
>- <b style="color: #43799a;">tree_max_depth: int</b>:
>  - Maximum depth of the decision tree.
>- <b style="color: #43799a;">tree_min_samples_split: int</b>:
>  - Minimum number of samples required to split an internal node.
>- <b style="color: #43799a;">gbm_learning_rate: float</b>:
>  - Learning rate for gradient boosting.
>- <b style="color: #43799a;">gbm_n_estimators: int</b>:
>  - Number of boosting stages for gradient boosting.

### Attributes
>- <b style="color: #43799a;">feature_importances_</b>:
>  - Array of feature importances after fitting the model.

### Methods
>- <b style="color: #43799a;">fit(X, y)</b>:
>  - Fit the model to data matrix X and target(s) y.
>- <b style="color: #43799a;">predict(X)</b>:
>  - Predict class labels for samples in X.
>- <b style="color: #43799a;">set_params(**params)</b>:
>  - Set parameters for the ensemble estimator.


## SetupFeatureSelection.py
> Set up the feature selection process in `TransProPy.UtilsFunction3`.
> This function is particularly useful for setting up a feature selection pipeline, especially in models that benefit from ensemble methods and mutual information-based feature selection.

### Returns
> - <b style="color: #43799a;">feature_selection: FeatureUnion</b>:
>   - A combined feature selection process.

### Description
> The `setup_feature_selection` function initializes and returns a `FeatureUnion` object for feature selection. This union includes:
> - `RFECV`: Utilizes an `EnsembleForRFE` estimator with `StratifiedKFold(5)` for cross-validation, focusing on accuracy.
> - `SelectKBest`: Applies `mutual_info_classif` for feature scoring.
> The combination of these techniques provides a robust approach to feature selection in machine learning models.


### Usage
> ```python
> feature_selection = setup_feature_selection()
> ```

## PrintBoxedText.py
> Prints a title in a boxed format in the console output.

### Parameters
> - <b style="color: #43799a;">title: str</b>:
>   - The text to be displayed inside the box.

### Returns
> - None. This function directly prints the formatted title to the console.

### Description
> This function creates a box around the given title text using hash (#) and equals (=) symbols. It prints the title with a border on the top and bottom, making it stand out in the console output. The border line consists of a hash symbol, followed by equals symbols the length of the title plus two (for padding), and then another hash symbol.

### Usage Example
> ```python
> print_boxed_text("Example Title")
> ```


## ExtractAndSaveResults.py
> The function uses matplotlib for plotting, pandas for data handling, and a custom `print_boxed_text` function for formatted output.

### Parameters
> - <b style="color: #43799a;">clf: trained model (RandomizedSearchCV object)</b>:
>   - The classifier object after training.
> - <b style="color: #43799a;">X: DataFrame</b>:
>   - Feature data used for training.
> - <b style="color: #43799a;">save_path: str</b>:
>   - Base path for saving results.
> - <b style="color: #43799a;">show_plot: bool</b> (optional):
>   - Whether to display the plot. Default is False.
> - <b style="color: #43799a;">use_tkagg: bool</b> (optional):
>   - Whether to use 'TkAgg' backend for matplotlib. Generally, choose False when using in PyCharm IDE, and choose True when rendering file.qmd to an HTML file.

### Description
> This function performs a comprehensive analysis and extraction of results from a trained model. It includes:
> - Extracting and plotting cross-validation results.
> - Identifying and printing features selected by RFECV and SelectKBest.
> - Combining and saving selected features in a CSV file.
> - Extracting and saving feature importances from EnsembleForRFE.
> - Extracting and saving scores from SelectKBest.

### Usage
> ```python
> extract_and_save_results(clf, X, "path/to/save/", show_plot=True)
> ```