Skip to content

Overhead from serialization contributing to task execution latency #3901

Open
@BarrySlyDelgado

Description

@BarrySlyDelgado

@gpauloski Noted that there is an import overhead using python tasks. From the example in #3892 there is a apparent import overhead of 1 tasks/sec. This may be an issue in how we serialize functions for distribution, which I may not understand entirely.

Currently, we use cloudpickle to serialize python function and arguments. There are some nuances regarding different serialization modules that are worth discussing. if we serialize the function below, loading the same function will cause imports to happen when deserialized.

import matplotlib.pyplot as plt

def func(x):
    if x == 1:
        return x
    else:
        plt.plot(1,1)

We see this if we try to deserialize without the relevant module in our environment

Traceback (most recent call last):
  File "/g/g16/slydelgado1/bslydelg_cctools/cctools/taskvine/src/bindings/load.py", line 10, in <module>
    x = cloudpickle.load(f)
        ^^^^^^^^^^^^^^^^^^^
  File "/g/g16/slydelgado1/miniconda3/envs/cenv/lib/python3.12/site-packages/cloudpickle/cloudpickle.py", line 457, in subimport
    __import__(name)
ModuleNotFoundError: No module named 'matplotlib'

If we comment out relevant imports this is not an issue unless you branch on the path that would use the import

#import matplotlib.pyplot as plt

def func(x):
    if x == 1:
        return x
    else:
        plt.plot(1,1)
x = cloudpickle.load(f)
x(0)
Traceback (most recent call last):
  File "/g/g16/slydelgado1/bslydelg_cctools/cctools/taskvine/src/bindings/load.py", line 11, in <module>
    x(0)
  File "/g/g16/slydelgado1/bslydelg_cctools/cctools/taskvine/src/bindings/test.py", line 15, in func
    plt.plot(1,1)
    ^^^
NameError: name 'plt' is not defined

From my perspective, this is the preferred failure case.

The first example above also causes an increased latency to deserialize the function:

109.98052644492145 deserializations/s

Vs. second example

4636.137402126293 deserializations/s

with dill this is not an issue in either case if the function is defined in main:

4605.406487074855 deserializations/s

However, the example has different behavior for functions defined outside the __main__ module similar to that of cloudpickle
For example, if func is defined outside __main__:

  File "/g/g16/slydelgado1/bslydelg_cctools/cctools/taskvine/src/bindings/load.py", line 10, in <module>
    x = dill.load(f)
        ^^^^^^^^^^^^
  File "/g/g16/slydelgado1/miniconda3/envs/cenv/lib/python3.12/site-packages/dill/_dill.py", line 289, in load
    return Unpickler(file, ignore=ignore, **kwds).load()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/g/g16/slydelgado1/miniconda3/envs/cenv/lib/python3.12/site-packages/dill/_dill.py", line 444, in load
    obj = StockUnpickler.load(self)
          ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/g/g16/slydelgado1/miniconda3/envs/cenv/lib/python3.12/site-packages/dill/_dill.py", line 434, in find_class
    return StockUnpickler.find_class(self, module, name)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/g/g16/slydelgado1/bslydelg_cctools/cctools/taskvine/src/bindings/serz.py", line 2, in <module>
    import matplotlib.pyplot as plt
ModuleNotFoundError: No module named 'matplotlib'

Additionally, switching cloupickle to dill does not necessarily improve latency,
From the example in #3892

cloudpickle:

16.100728164861362 tasks/s

dill:

10.413069189953614 tasks/s

Metadata

Metadata

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions