- Clone the repository
git clone https://github.com/ServiceNow/StarFlow.git
cd StarFlow
- Edit
~/.secret
(create a new file if it does not exist)
export HF_TOKEN=<HF_TOKEN>
export WANDB_API_KEY=<WANDB_API_KEY>
export OPENROUTER_API_KEY=<OPENROUTER_API_KEY>
export OUTPUT_DIR=<OUTPUT_DIR>
...
- Edit
~/.bashrc
(create a new file if it does not exist)
source ~/.secret
...
- Install packages
# for Llama, Qwen, Pixtral, and API models
bash installer/default/install.sh
# for Phi-3.5 model
bash installer/phi35/install.sh
# for Phi-4 model
bash installer/phi4/install.sh
# for DeepSeek models
bash installer/deepseek/install.sh
- Training
torchrun \
--nproc-per-node 2 \
starflow/pipeline/train.py \
dataset_config_file=starflow/config/dataset/bigdocs_sketch2flow.yaml \
model_config_file=starflow/config/model/llama_32_11b.yaml \
pipeline_config_file=starflow/config/pipeline/train.yaml
- Evaluation
torchrun \
--nproc-per-node 2 \
starflow/pipeline/evaluate.py \
dataset_config_file=starflow/config/dataset/bigdocs_sketch2flow.yaml \
model_config_file=starflow/config/model/llama_32_11b.yaml \
pipeline_config_file=starflow/config/pipeline/evaluate.yaml
- Evaluation for very large models (e.g. Llama-3.2-90B-Vision-Instruct)
python \
starflow/pipeline/evaluate.py \
dataset_config_file=starflow/config/dataset/bigdocs_sketch2flow.yaml \
model_config_file=starflow/config/model/llama_32_90b.yaml \
pipeline_config_file=starflow/config/pipeline/evaluate.yaml
- Evaluation for API models (e.g. GPT-4o)
python \
starflow/pipeline/evaluate_api.py \
dataset_config_file=starflow/config/dataset/bigdocs_sketch2flow.yaml \
model_config_file=starflow/config/model/gpt_4o.yaml \
pipeline_config_file=starflow/config/pipeline/evaluate_api.yaml
-
Other models can be trained and evaluated by setting their config file path as the value of
model_config_file
. -
The values in the involved config files should be set properly before running training and evaluation.
StarFlow consists of four types of components: datasets, metrics, models, and pipelines.
Datasets provide vision-language data for training and evaluation. They are encapsulated as sub-classes of VLDataset
. For example, the BigDocs
datasets are encapsulated as BigDocsDataset
.
When instantiating a dataset, its data examples are first loaded from either Hugging Face or local storage, and then encapsulated as VLExample
.
Each dataset comes with a config file, which specifies the settings for instantiating and using the dataset. For example, the config file for ServiceNow/BigDocs-Sketch2Flow
is starflow/config/dataset/bigdocs_sketch2flow.yaml
.
Metrics compute performance numbers of models on datasets. They are encapsulated as sub-classes of VLMetric
. For example, the Flow Similarity
metric is encapsulated as FlowSimilarityMetric
.
When using a metric to evaluate a model on a dataset, the metric compares the outputs of the model with the corresponding ground truths in the dataset and thereby obtains the performance numbers.
Each metric is applied to one or more datasets, and the settings for instantiating and using the metric are specifed in the config files of the target datasets. For example, the settings for FlowSimilarityMetric
are specified in the config file of ServiceNow/BigDocs-Sketch2Flow
(starflow/config/dataset/bigdocs_sketch2flow.yaml
).
Models generate textual outputs given vision-language inputs from datasets. They are encapsulated as sub-classes of VLModel
, and their inputs are encapsulated as sub-classes of VLInput
. For example, the Llama-3.2-Vision-Instruct
models are encapsulated as LlamaModel
, and their inputs are encapsulated as LlamaInput
.
When training a model, a cross-entropy loss is obtained from the forward pass of the model, which is then optimized in the backward pass through gradient decent. When evaluating a model, the textual outputs of the model are processed by the applied metrics to compute performance numbers.
Each model comes with a config file, which specifies the settings for instantiating and using the model. For example, the config file for Llama-3.2-11B-Vision-Instruct
is starflow/config/model/llama_32_11b.yaml
.
A special category of models is API models, which can only be used through API calls. They are encapsulated as sub-classes of VLAPIModel
, and each of them comes with a config file. For example, the OpenRouter-routed GPT-4o
model is encapsulated as OpenRouterAPIModel
, and its config file is starflow/config/model/gpt_4o.yaml
. API models cannot be trained, but can still be evaluated.
Pipelines are Python scripts that execute complete processes with datasets, metrics, and models. There are three pipelines, each of with comes with a config file:
-
Training pipeline: the pipeline for training a model on a dataset. It is implemented as
starflow/pipeline/train.py
, and its config file isstarflow/config/pipeline/train.yaml
. -
Evaluation pipeline: the pipeline for evaluating a model on a dataset with the applied metrics. It is implemented as
starflow/pipeline/evaluate.py
, and its config file isstarflow/config/pipeline/evaluate.yaml
. -
API model evaluation pipeline: the pipeline for evaluating an API model on a dataset with the applied metrics. It is implemented as
starflow/pipeline/evaluate_api.py
, and its config file isstarflow/config/pipeline/evaluate_api.yaml
.
@article{bechard2025starflow,
title={StarFlow: Generating Structured Workflow Outputs From Sketch Images},
author={Bechard, Patrice and Wang, Chao and Abaskohi, Amirhossein and Rodriguez, Juan and Pal, Christopher and Vazquez, David and Gella, Spandana and Rajeswar, Sai and Taslakian, Perouz},
journal={arXiv preprint arXiv:2503.21889},
year={2025}
}