-
Notifications
You must be signed in to change notification settings - Fork 719
[ENH] Precompute data to accelerate training in GPU #1850
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
The primary issue here is that `__getitem__ ()` performs pre-processing, which are typically conducted before actually training. As a result, the GPU becomes frequently idle, resulting to slower training completion. Its known that GPU typically achieve higher throughput the more it can be made busy, but the pre-processing done in `__getitem__ ()` each time an item is retrieved massively impacts this. This commit ensures that pre-processing is performed prior to training. This is achieved by the following: 1. calls with `to_dataloader ()` will also call the `precompute ()` function 2. the `precompute ()` function handles the collection of pre-computed items from `__precompute__getitem__ ()` and store it in a cache. this function relies on sampler to be able to retrieve data index 3. the `__precompute__getitem__ ()` is the unmodified algorithm of the original `__getitem__ ()` to ensure equivalent outcome 4. the new `__getitem__ ()` retrieves items from cache in order, this is because relying on `idx` may result to a different index sequence due to the first sampler call from `precompute ()`
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
FYI @phoeenniixx, @PranavBhatP, @xandie985 - is this something we need to take into account in v2? We should also think about adding profiling checks.
Code quality (linting) is failing, this can be fixed by using pre-commit or automating formatting. The pytorch-forecasting
does not have this in the docs (we should add it), but this is the same as in sktime
:
https://www.sktime.net/en/stable/developer_guide/coding_standards.html
FYI @agobbifbk |
Agree, most of the time you can fit your data in memory and we should include the precomputation possibility in the d2 layer. We should have already the correct indexes computed, it is just a matter of create the tensors according to those indexes.
When you say |
This is not an issue anymore, I missed removing the fit function (conducting during my testing) I placed prior to the Tuner. |
fast path can be activated by enabling `precompute=True` in to_dataloader: ```python .to_dataloader (..., precompute=True) ```
Hi @jobs-git, I see you are still facing some linting issues, may I suggest you try setting up pre-commit in your local repo, the process is similar to I would suggest setting up
and this will reduce your effort very much ( it automatically solves some issues, not requiring you to make changes) |
you donot need to wait here and make changes based on the errors shown in code-quality ci workflows :) |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1850 +/- ##
=======================================
Coverage ? 87.22%
=======================================
Files ? 158
Lines ? 9312
Branches ? 0
=======================================
Hits ? 8122
Misses ? 1190
Partials ? 0
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
is this ready for review? |
Yes, please. Inputs are welcome. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks - while we wait for reviews, can you kindly add tests? For TimeSeriesDataSet
in isolation, and in integration with the networks?
We need exact parity for any previously working code - we cannot push a breaking change with a release - it would be great if we could also ensure this with a test. Namely, that |
Got it, I will test this and add that as sample, and probably some demonstration on available real world samples from sktime too. |
This notebook demonstrates the benefit of precomputing tensors prior to model training in GPU. GPU is a througthput device, that means we get higher performance the more it is active. However, it is well known that there is a very high cost to stalling data transfer from CPU to GPU, in addition just-in-time calculations that CPU does in between batches and item retrieval further starves our expected througthput. To overcome this, we can precompute all the tensors that the Model needs to be able to continously perform forward and backward prop with minimal delay in the GPU. * Initial setup * Testing on precompute = True * Testing on default code * Performance comparisons
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
@jobs-git, are you still working on this? Let us know if you need help! |
nothing anymore, it's all complete |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
quick request: can you kindly move the precompute_bench notebooks to another branch? They have quite large size.
Sure, will separate the notebooks. |
The commit to achieve close to parity algorithm with the old batching lowered the gain. So far, I can achieve about 900% in synthetic data. Will create a separate PR that demonstrate its usage and benchmark performance. |
Description
Fixes: #1849; Fixes: #1426; Fixes: #1860; Fixes: #922; Fixes: #715; Fixes: sktime/sktime#8278
Closes: #1846; Closes: #991
Supersedes: #806
The
__getitem__()
and_collate_fn()
methods inTimeSeriesDataset
are performance bottlenecks. They run computation extensive branching logic on every call, which temporarily stalls compute devices and slows down training.To improve in these, the following major changes were made:
__precompute__
that performs the needed calculations conducted by__getitem__
and_collate_fn()
then stores batch data toself.precollate_cache
, the__precompute__
does not assume its own data order, but follows theidx
provided byBatchSamplers
__fast_collate_fn__
which performs retrieval of batch data fromself.precollate_cache
__getitem__
to__item_tensor__
so it can be used in pre-compute__getitem__
to accommodate the usage of old code and fast path whenprecompute=True
precompute=True
,__getitem__
will returnNone
as only thereturn
value of__fast_collate_fn__
is needed byDataLoader
to retrieve necessary batch to train modelsNote:
TimeSeriesDataset
defaults to slow path, so the old method is used.Caveat and Limitations
This feature stores precomputed data in RAM. Therefore, sufficient memory must be available; otherwise, an out-of-memory (OOM) error may occur.
Super batching could be considered if
precompute=True
is required. However, implementing that is beyond the scope of this PR.Recommend to use:
precompute=False
to use the original slow path.For extremely large datasets, FSDP (Fully Sharded Data Parallel) may be used with PyTorch Lightning for distributed training. This, too, is outside the scope of this PR.
Recommend to use:
precompute=False
to use the original slow path.Checklist
__getitem__()
collate_fn()
precompute=True
precompute=True
is consistent to default behaviour (precompute=False
)pre-commit install
.To run hooks independent of commit, execute
pre-commit run --all-files