Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat: Multi-GPU Evaluation #3611

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

MattGPT-ai
Copy link
Contributor

Flair now supports multi-GPU training, but not evaluation. This means that the work of n-1 GPUs is wasted during evaluation time, and this can dramatically reduce the benefit of multi-GPU training if your eval set is considerable in size. Even worse, I believe it can be slower than single GPU evaluation, as CPU portions of the evaluation code have to repeat n times, but are sharing the same CPU and memory resources.

This PR implements multi-GPU acceleration for evaluate in the Classifier, TextRegressor, and TextPairRegressor model types. It uses the DistributedSampler to split the eval set between the GPUs, predictions are run, and the results of inference are aggregated between processes before the metrics are calculated in the main process and returned.

@MattGPT-ai MattGPT-ai force-pushed the mattb.multi-gpu-evaluate branch from da4e8f0 to 7734bdd Compare February 5, 2025 06:58
@MattGPT-ai
Copy link
Contributor Author

Looks like checks are hitting an unrelated type error:
4256: error: Argument 1 to "Entity" has incompatible type "tuple[Optional[int], int]"; expected "tuple[int, int]" [arg-type]

@alanakbik
Copy link
Collaborator

@MattGPT-ai this is due to a new mypy version and affected a deprecated class. I just fixed it in #3613. If you update this branch to current master the error should disappear.

…gather functions to distributed utils.

This works by using a DistributedSampler to allocate samples across GPU processes, then aggregates all the predictions and losses from all processes before running the evaluation. Uses broadcast to ensure all processes return the same valid result.
@MattGPT-ai MattGPT-ai force-pushed the mattb.multi-gpu-evaluate branch from 7734bdd to c0f7e7d Compare February 5, 2025 22:33
@MattGPT-ai
Copy link
Contributor Author

Awesome that worked, checks passed!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants