Skip to content

stac-task is really slow to start up #179

@jkeifer

Description

@jkeifer

tl;dr: we need to consider how to make imports of stac-task components lazier, so we don't have to execute so much code on start up, especially for things we will never need to use.

Doing so is likely a breaking change, and requires users to import from a more directly path, i.e., not from stactask but from a subpackage or module path within stac-task.


We recently factored out the payload modeling and validation from the Task class into its own Payload class. This is a big win because it allows us to use that Payload model as the source of truth for a stac-task payload, and import it in other projects that need that model (such as cirrus).

The problem we found doing this in cirrus, however, is that we now have several issues because of stac-task dependencies and its init. Bear with me as I explain all this...

It's become a common pattern in python to expose models/functions as your "public API" by importing them in a package's top-level __init__.py and listing them out in __all__. This allows users to do something like from stactask import Task instead of having to do the more-verbose and less-obvious from stactask.task import Task. On the surface this is a good idea, because then users can simply look at what is exposed by your package (import stactask; dir(stactask)) to know what they likely need to know about and may want to use.

The problem with this becomes one of non-lazy init. Pythonic conventions state that we should always put our imports at the top of a module, in the module scope, except in the case of circular dependencies (which is probably a sign you need to refactor, so you can get back to imports at the top). But this convention, along with putting stuff in __init__.py modules, is a big reason why python is notorious for such slow startup times. I can use the case of cirrus importing the Payload class as a good example of this.

In cirrus we import import stactask.payload so we can use stactask.payload.Payload. The thing is, for python's import machinery to import stactask.payload, it also must execute stactask.__init__. Even if in payload.py we only import warnings and from typing import Any, python must initialize the parent package in case the module requires anything from the package, or the package init has any side effects that might impact the imported module's behaviors.

So let's look at what is in stactask.__init__:

from .config import DownloadConfig
from .payload import Payload
from .task import Task

try:
    from .__version__ import __version__, __version_tuple__
except ImportError:
    __version__ = "0.0.0"
    __version_tuple__ = ("0", "0", "0")

__all__ = [
    "__version__",
    "__version_tuple__",
    "Task",
    "Payload",
    "DownloadConfig",
]

Not a lot. But actually, quite a lot.

Let's just constrain ourselves to looking at DownloadConfig in stactask.config. When we run from .config import DownloadConfig, we execute stactask.config. This is also a small module with big implications:

from dataclasses import dataclass

from stac_asset import Config


@dataclass
class DownloadConfig(Config):
    pass

Doesn't look like much, but a critical line is from stac_asset import Config. Immediately we should see some warning lights flashing: importing a class from a package means either that class is defined in the package __init__.py, or more likely, the package is doing that thing again of importing all the things it wants to expose so they can be imported from the top level. But this means, as we discussed above, that many things are potentially happening when we import a silly little data class config model from stac_asset.

So what is stac_asset doing in its __init__.py? Shall we take a look?

Here's a snippet of just the imports from stac_asset.__init__:

from ._functions import (
    assert_asset_exists,
    asset_exists,
    download_asset,
    download_collection,
    download_file,
    download_item,
    download_item_collection,
    open_href,
    read_href,
)
from .client import Client, get_client_classes
from .config import Config
from .earthdata_client import EarthdataClient
from .errors import (
    AssetOverwriteError,
    ConfigError,
    ContentTypeError,
    DownloadError,
    DownloadWarning,
)
from .filesystem_client import FilesystemClient
from .http_client import HttpClient
from .messages import Message
from .planetary_computer_client import PlanetaryComputerClient
from .s3_client import S3Client
from .strategy import ErrorStrategy, FileNameStrategy

Phew, that's a lot of stuff. Immediately we can see some things that could be suspect: if you just want to use S3, then EarthdataClient and PlanetaryComputerClient feel like bloat. Maybe they're not, but they certainly seem to be.

I'll cut to the chase: why am I going on about this? Well, when tried to run the lambda zip for cirrus after adding the dependency on the pure-python Payload class from stac-task, the lambdas all errored out. Why? pydantic-core could not be found!

Yes, you read that right. pydantic-core. What? Where is that coming from. And why, if we are managing deps correctly, could it not be found?

Following this chain down, cirrus now depends on stac-task. stac-task on stac-asset. stac-asset has a dependency to support Oath2, where required, called aiohttp-oauth2-client. This lib uses pydantic for its request and response models, and thus it has a transitive dependency on pydantic-core. pydantic-core is a compiled dependency of pydantic as of v2 (the core of v2 is built in rust). As a compile dependency, we now have to ensure we have a python-runtime-version-specific bundle of our deps in the lambda zip, that also targets the correct lambda system architecture.

In our cirrus case, builds on an x86_64 github runner were then incompatible with an arm64 lambda deployment! Because of stac-asset using a thing depending on pydantic, but we're not even using stac-asset. It just comes along for the ride because stac-task is importing that Config class through the chain of its __init__.py.

The other big thing we import in the __init__.py is the Task` class, which has even more imports in it.

If a client only ever wants to use Payload, they're paying the cost of having to initialize a ton of code. Some anecdotal testing showed 1.2 seconds of lambda init time shaved off by deleting all imports from stactask.__init__ (aside from the version ones).

So what can we do about this?

Well, the easy answer, do what we tested above and remove the Payload, DownloadConfig, and Task imports from stactask.__init__ and make users import those from their respective modules.

More complex would be to try to use lazy imports. Move imports into functions that need them instead of doing them at the top of a module. Find other ways to put them behind a callable so they don't run at module init, but at time of actual use. In few, if any, cases will all imports actually be used.

This could trickle down into stac-asset too. There's a lot of conflicting dependencies there--what if you don't even need Oauth2 support? Should some of these deps be extras? Should stac-asset be an extra of stac-task?

Or do we break packages apart even further? The Payload class, for example, could move into a tiny package of its own, and it could be a dependency of both stac-task and cirrus.

Many of this ideas are anti-user-friendly. They require users to understand what options/extras are available and how packages/modules are structured. Others are anti-dev-friendly, and expose things directly to users we might want to obscure more to allow organizational changes without a breaking change. Or increase development overhead and the cost of making changes.

It's hard to see what the right answer is. But I think the clear problem is code executed indiscriminately on init. Are there patterns we can embrace as package maintainers that could allow for easier support for lazy imports? Can we expose our cake without having to bake it until we actually want to eat it, and then eat it too?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions