Skip to content

Draw representative sample distribution from built data #100

@nsfabina

Description

@nsfabina

The current data build rules are to take a certain number of samples from each training data file, defined by the max number of samples and max data size.

Imagine you have categorical responses with disproportionate representation (well, this is what I actually have). Simply weighting the responses according to their proportion works relatively well, as currently implemented. However, it seems that the weightings are not sufficient if the imbalance is extreme enough.

One solution could be to draw representative samples from the training dataset to even out the response distribution.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions