Add probabilistic extrapolation model for classification accuracy #503

minghao-sun-sc · 2025-05-08T03:59:13Z

Pull Request: Accuracy Extrapolation Module for PyHealth

Contributor Information

Contributors: Minghao Sun
UIUC NetID: msun60
Paper title: A Probabilistic Method to Predict Classifier Accuracy on Larger Datasets given Small Pilot Data
Paper link: https://arxiv.org/abs/2311.18025

Contribution Type

Dataset Performance Extrapolation Module

Description

This pull request adds a new module to PyHealth that enables users to predict model performance (accuracy, AUROC, etc.) when trained on larger datasets based on smaller pilot datasets. The implementation builds on the APEx-GP approach from "A Probabilistic Method to Predict Classifier Accuracy on Larger Datasets given Small Pilot Data" with two significant improvements:

Matern Kernels: Provides more realistic modeling of learning curves compared to standard RBF kernels, achieving lower MSE (up to 13.1% improvement)
Beta Priors: Better handling of bounded accuracy metrics (like AUROC) constrained to [0,1]

The module is particularly valuable for healthcare ML applications where data collection is expensive and time-consuming, as it helps researchers make informed decisions about whether collecting more data is likely to significantly improve model performance.

Files Overview

Core Implementation:
- pyhealth/metrics/extrapolation.py: Main module implementing GP-based performance extrapolation
- pyhealth/metrics/__init__.py: Updated to include the new module exports
- pyhealth/utils.py: Added tensor_to_numpy helper function
Examples & Documentation:
- pyhealth/metrics/README_EXTRAPOLATION.md: Detailed module documentation
- PyHealth/examples/accuracy_extrapolation_example.py: Example usage script
Tests:
- pyhealth/unittests/test_extrapolation.py: Unit tests for the module
Dependencies:
- Added gpytorch and matplotlib to requirements.txt

Add probabilistic extrapolation model for classification accuracy

7f47a99

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add probabilistic extrapolation model for classification accuracy #503

Add probabilistic extrapolation model for classification accuracy #503

Uh oh!

minghao-sun-sc commented May 8, 2025

Uh oh!

Uh oh!

Add probabilistic extrapolation model for classification accuracy #503

Are you sure you want to change the base?

Add probabilistic extrapolation model for classification accuracy #503

Uh oh!

Conversation

minghao-sun-sc commented May 8, 2025

Pull Request: Accuracy Extrapolation Module for PyHealth

Contributor Information

Contribution Type

Description

Files Overview

Uh oh!

Uh oh!