Description
@gpauloski Noted that there is an import overhead using python tasks. From the example in #3892 there is a apparent import overhead of 1 tasks/sec. This may be an issue in how we serialize functions for distribution, which I may not understand entirely.
Currently, we use cloudpickle
to serialize python function and arguments. There are some nuances regarding different serialization modules that are worth discussing. if we serialize the function below, loading the same function will cause imports to happen when deserialized.
import matplotlib.pyplot as plt
def func(x):
if x == 1:
return x
else:
plt.plot(1,1)
We see this if we try to deserialize without the relevant module in our environment
Traceback (most recent call last):
File "/g/g16/slydelgado1/bslydelg_cctools/cctools/taskvine/src/bindings/load.py", line 10, in <module>
x = cloudpickle.load(f)
^^^^^^^^^^^^^^^^^^^
File "/g/g16/slydelgado1/miniconda3/envs/cenv/lib/python3.12/site-packages/cloudpickle/cloudpickle.py", line 457, in subimport
__import__(name)
ModuleNotFoundError: No module named 'matplotlib'
If we comment out relevant imports this is not an issue unless you branch on the path that would use the import
#import matplotlib.pyplot as plt
def func(x):
if x == 1:
return x
else:
plt.plot(1,1)
x = cloudpickle.load(f)
x(0)
Traceback (most recent call last):
File "/g/g16/slydelgado1/bslydelg_cctools/cctools/taskvine/src/bindings/load.py", line 11, in <module>
x(0)
File "/g/g16/slydelgado1/bslydelg_cctools/cctools/taskvine/src/bindings/test.py", line 15, in func
plt.plot(1,1)
^^^
NameError: name 'plt' is not defined
From my perspective, this is the preferred failure case.
The first example above also causes an increased latency to deserialize the function:
109.98052644492145 deserializations/s
Vs. second example
4636.137402126293 deserializations/s
with dill
this is not an issue in either case if the function is defined in main:
4605.406487074855 deserializations/s
However, the example has different behavior for functions defined outside the __main__
module similar to that of cloudpickle
For example, if func
is defined outside __main__
:
File "/g/g16/slydelgado1/bslydelg_cctools/cctools/taskvine/src/bindings/load.py", line 10, in <module>
x = dill.load(f)
^^^^^^^^^^^^
File "/g/g16/slydelgado1/miniconda3/envs/cenv/lib/python3.12/site-packages/dill/_dill.py", line 289, in load
return Unpickler(file, ignore=ignore, **kwds).load()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/g/g16/slydelgado1/miniconda3/envs/cenv/lib/python3.12/site-packages/dill/_dill.py", line 444, in load
obj = StockUnpickler.load(self)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/g/g16/slydelgado1/miniconda3/envs/cenv/lib/python3.12/site-packages/dill/_dill.py", line 434, in find_class
return StockUnpickler.find_class(self, module, name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/g/g16/slydelgado1/bslydelg_cctools/cctools/taskvine/src/bindings/serz.py", line 2, in <module>
import matplotlib.pyplot as plt
ModuleNotFoundError: No module named 'matplotlib'
Additionally, switching cloupickle
to dill
does not necessarily improve latency,
From the example in #3892
cloudpickle
:
16.100728164861362 tasks/s
dill
:
10.413069189953614 tasks/s