-
Notifications
You must be signed in to change notification settings - Fork 238
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
use a dummy operator at the start of parallel pipelines #2197
base: devel
Are you sure you want to change the base?
Conversation
✅ Deploy Preview for dlt-hub-docs canceled.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alucryd there's a reason to run first task and then all others in parallel: it will create initial schema in the database and standard dlt tables. all tasks share the same dataset.
if you still want to work on this PR then let's add new option to add_run
: ie dummy_task_first
and if set to True, do what do right now.
I do not want to change existing behavior, too many deployments that may rely on that are in production
I see, thanks for the heads up, I don't have the full picture yet but I'm getting there. I ran this change in production and didn't run into any issue with a completely new datasource so I wrongly assumed it would be harmless. I assume it would be too much work to split the schema and table creations and only run that in the first task? In any case I'll add the proposed option and default it to false so it doesn't impact anyone. |
@alucryd yeah we could think of some "preparatory" task but IMO in that case it is better to just create a callback that receives a DAG from airflow helper and can modify it... we already have but that's a separate ticket I'd say - if you'd like to try to add it |
@alucryd do you plan to continue on this? |
Description
Used a
DummyOperator
instead of the first source to parallelize all sources, including the first one.Related Issues
Additional Context
The first source can take a long time to run, this can make pipelines faster by parallelizing even the first source.