-
Notifications
You must be signed in to change notification settings - Fork 6.3k
Support Flux Kontext in modular (T2I and I2I) #12269
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Cc: @asomoza if you wanna test. |
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks! @sayakpaul
I left some thoughts/comments, I think the best approach is to wait for a couple days for our mellon nodes to be merged (they are almost ready) - that way we can play with it and test out how it works with flux
This way you can get first hand experience with node-based workflow and how we best structure our blocks to support that use case. The plan is to be able to have all our modular pipelines work out-of-box with the same set of mellon nodes :)
block_state.batch_size * block_state.num_images_per_prompt, seq_len, -1 | ||
) | ||
# TODO: `_auto_resize` is currently forced to True. Since it's private anyway, I thought of not adding it. | ||
block_state.image = self.preprocess_image( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think image processing should be part of vae encoder because in node-based modular workflow, the "encoding"/"decoding" all would need to be done as seperate process
(similar to how you could generate text embeddings, and then delete the text encoders and run the inference with the embedding directly -> same process can be applied to images)
so we arrange our blocks to support node-based workflow - I usually put all the encode blocks first (include text-encoder/vae encoder/image encoder (IP-adapter case) ....); so that these blocks can be poped into seperate process if needed,
and then I put an input step after that to standardize all the encoder outputs to prepare for the actual denoise step, which usually include set_timesteps, prepare_latents, denoising loop
) | ||
|
||
@staticmethod | ||
def prepare_latents( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should move the encoding out of prepare_latents step, this way we can pop the vae encoder as its own node and just generate the image latent once; since this part is usually same for your workflow
and then they can do different things with the image latents as they like: change steps/scheduler/seeds/batch_size/prompts
this will work nicely with hybrid inference too when we move the vae/text encoders remote
@yiyixuxu okay! Could you give this PR a ping once you think I can work out the changes you mentioned or would you rather if I addressed the feedback before? |
What does this PR do?
Code to test
I have compared this with the regular Kontext pipeline outputs and my eyes didn't catch any differences.
Code