08-Pytorch_paper_replicating - Unable to train in colab #980

dave-ct · 2024-06-21T13:37:29Z

dave-ct
Jun 21, 2024

HI, I have been trying to do the initial training in section 9.3 but my colab environment crashes.

To make sure it was not my code I have now tried with the 08 notebook from github and one epoch is taking 8 miuntes and then it crashes.
'Your session crashed. Automatically restarting`
I can see it is using nearly all the memory is used. Colab wont allocate me a GPU (I have the pro version) so am stuck with CPU

I have tried setting the batch size for the training data to 8 instead of 32, while this reduces the memory, the time it taking is estimated to 1hr 22 minutes 1/10 [09:06<1:22:00, 546.78s/it]
But this really cut sinto my time allocate to do this module......

Any Suggestions on how to speed this up or has anyone else this type of issue?

As side note: Running it on Jupyter Lab locally

I have then setup a local docker instance of JupyterLab on a Ubuntu workstation, this also had issues with the dataloaders and allocating shared memory. to get round this I first set shm_size: '8gb' in the docker config for it and then there where still issues with the workers (maybe it is the type of cpu it has)

import os
os.cpu_count()
16

I don't think all the cores are the same in the type of CPU I had so I edited the data_setup.py from
NUM_WORKERS = os.cpu_count() to NUM_WORKERS = 0. This is probably not the best solution for speed but it got me working on my own setup. The time is still not the best compared to the video but at least its only 13 minutes for all 10 epochs with a batch size of 32

Here are the results I got

100% 10/10 [13:19<00:00, 80.02s/it]
Epoch: 1 | train_loss: 5.1311 | train_acc: 0.3398 | test_loss: 4.9150 | test_acc: 0.2604
Epoch: 2 | train_loss: 1.7756 | train_acc: 0.2930 | test_loss: 2.0911 | test_acc: 0.2604
Epoch: 3 | train_loss: 1.3007 | train_acc: 0.3711 | test_loss: 1.4524 | test_acc: 0.2604
Epoch: 4 | train_loss: 1.1873 | train_acc: 0.2891 | test_loss: 1.2430 | test_acc: 0.2604
Epoch: 5 | train_loss: 1.1479 | train_acc: 0.4180 | test_loss: 1.0052 | test_acc: 0.5417
Epoch: 6 | train_loss: 1.1476 | train_acc: 0.4258 | test_loss: 1.3913 | test_acc: 0.1979
Epoch: 7 | train_loss: 1.1564 | train_acc: 0.4336 | test_loss: 1.1495 | test_acc: 0.2604
Epoch: 8 | train_loss: 1.1925 | train_acc: 0.3047 | test_loss: 1.0623 | test_acc: 0.5417
Epoch: 9 | train_loss: 1.2032 | train_acc: 0.2812 | test_loss: 1.2964 | test_acc: 0.2604
Epoch: 10 | train_loss: 1.2095 | train_acc: 0.3047 | test_loss: 1.0208 | test_acc: 0.5417

For anyone reference, this was the error I was seeing relating to the number of workers after I had already increased the shm_size: when trying to visulise the dataloaders

Traceback (most recent call last):
  File "[/opt/conda/lib/python3.11/multiprocessing/queues.py", line 244](http://192.168.0.251:10000/opt/conda/lib/python3.11/multiprocessing/queues.py#line=243), in _feed
    obj = _ForkingPickler.dumps(obj)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "[/opt/conda/lib/python3.11/multiprocessing/reduction.py", line 51](http://192.168.0.251:10000/opt/conda/lib/python3.11/multiprocessing/reduction.py#line=50), in dumps
    cls(buf, protocol).dump(obj)
  File "[/opt/conda/lib/python3.11/site-packages/torch/multiprocessing/reductions.py", line 568](http://192.168.0.251:10000/opt/conda/lib/python3.11/site-packages/torch/multiprocessing/reductions.py#line=567), in reduce_storage
    fd, size = storage._share_fd_cpu_()
               ^^^^^^^^^^^^^^^^^^^^^^^^
  File "[/opt/conda/lib/python3.11/site-packages/torch/storage.py", line 304](http://192.168.0.251:10000/opt/conda/lib/python3.11/site-packages/torch/storage.py#line=303), in wrapper
    return fn(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "[/opt/conda/lib/python3.11/site-packages/torch/storage.py", line 374](http://192.168.0.251:10000/opt/conda/lib/python3.11/site-packages/torch/storage.py#line=373), in _share_fd_cpu_
    return super()._share_fd_cpu_(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: unable to write to file <[/torch_14589_1811245570_1](http://192.168.0.251:10000/torch_14589_1811245570_1)>: No space left on device (28)
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x757f11790ae0>
Traceback (most recent call last):
  File "[/opt/conda/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 1479](http://192.168.0.251:10000/opt/conda/lib/python3.11/site-packages/torch/utils/data/dataloader.py#line=1478), in __del__
    self._shutdown_workers()
  File "[/opt/conda/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 1443](http://192.168.0.251:10000/opt/conda/lib/python3.11/site-packages/torch/utils/data/dataloader.py#line=1442), in _shutdown_workers
    w.join(timeout=_utils.MP_STATUS_CHECK_INTERVAL)
  File "[/opt/conda/lib/python3.11/multiprocessing/process.py", line 149](http://192.168.0.251:10000/opt/conda/lib/python3.11/multiprocessing/process.py#line=148), in join
    res = self._popen.wait(timeout)
          ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "[/opt/conda/lib/python3.11/multiprocessing/popen_fork.py", line 40](http://192.168.0.251:10000/opt/conda/lib/python3.11/multiprocessing/popen_fork.py#line=39), in wait
    if not wait([self.sentinel], timeout):
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "[/opt/conda/lib/python3.11/multiprocessing/connection.py", line 948](http://192.168.0.251:10000/opt/conda/lib/python3.11/multiprocessing/connection.py#line=947), in wait
    ready = selector.select(timeout)
            ^^^^^^^^^^^^^^^^^^^^^^^^
  File "[/opt/conda/lib/python3.11/selectors.py", line 415](http://192.168.0.251:10000/opt/conda/lib/python3.11/selectors.py#line=414), in select
    fd_event_list = self._selector.poll(timeout)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "[/opt/conda/lib/python3.11/site-packages/torch/utils/data/_utils/signal_handling.py", line 66](http://192.168.0.251:10000/opt/conda/lib/python3.11/site-packages/torch/utils/data/_utils/signal_handling.py#line=65), in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 14605) is killed by signal: Bus error. It is possible that dataloader's workers are out of shared memory. Please try to raise your shared memory limit.

Just in terms of an update, I just realised that on my local setup for data_setup.create_dataloaders it already had the num_workers as a paramter so can just do the below change in the notebook instead of editing 'data_setup.py'. So can change the dataloader part to have this set to 0

# Set the batch size
BATCH_SIZE = 32 # this is lower than the ViT paper but it's because we're starting small

# Create data loaders
train_dataloader, test_dataloader, class_names = data_setup.create_dataloaders(
    train_dir=train_dir,
    test_dir=test_dir,
    transform=manual_transforms, # use manually created transforms
    batch_size=BATCH_SIZE,
    num_workers=0
)

train_dataloader, test_dataloader, class_names

miladnorouziii · 2024-06-22T08:04:54Z

miladnorouziii
Jun 22, 2024

Consider using Kaggle. use GPU for your train. on CPU you have the limitation for speed.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

08-Pytorch_paper_replicating - Unable to train in colab #980

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

08-Pytorch_paper_replicating - Unable to train in colab #980

Uh oh!

Uh oh!

dave-ct Jun 21, 2024

Any Suggestions on how to speed this up or has anyone else this type of issue?

Replies: 1 comment

Uh oh!

miladnorouziii Jun 22, 2024

dave-ct
Jun 21, 2024

miladnorouziii
Jun 22, 2024