Skip to content

Add probabilistic extrapolation model for classification accuracy #503

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

minghao-sun-sc
Copy link

Pull Request: Accuracy Extrapolation Module for PyHealth

Contributor Information

  • Contributors: Minghao Sun
  • UIUC NetID: msun60
  • Paper title: A Probabilistic Method to Predict Classifier Accuracy on Larger Datasets given Small Pilot Data
  • Paper link: https://arxiv.org/abs/2311.18025

Contribution Type

Dataset Performance Extrapolation Module

Description

This pull request adds a new module to PyHealth that enables users to predict model performance (accuracy, AUROC, etc.) when trained on larger datasets based on smaller pilot datasets. The implementation builds on the APEx-GP approach from "A Probabilistic Method to Predict Classifier Accuracy on Larger Datasets given Small Pilot Data" with two significant improvements:

  1. Matern Kernels: Provides more realistic modeling of learning curves compared to standard RBF kernels, achieving lower MSE (up to 13.1% improvement)
  2. Beta Priors: Better handling of bounded accuracy metrics (like AUROC) constrained to [0,1]

The module is particularly valuable for healthcare ML applications where data collection is expensive and time-consuming, as it helps researchers make informed decisions about whether collecting more data is likely to significantly improve model performance.

Files Overview

  • Core Implementation:

    • pyhealth/metrics/extrapolation.py: Main module implementing GP-based performance extrapolation
    • pyhealth/metrics/__init__.py: Updated to include the new module exports
    • pyhealth/utils.py: Added tensor_to_numpy helper function
  • Examples & Documentation:

    • pyhealth/metrics/README_EXTRAPOLATION.md: Detailed module documentation
    • PyHealth/examples/accuracy_extrapolation_example.py: Example usage script
  • Tests:

    • pyhealth/unittests/test_extrapolation.py: Unit tests for the module
  • Dependencies:

    • Added gpytorch and matplotlib to requirements.txt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants