[SDP] Create a new dataflow graph

jaceklaskowski · jaceklaskowski · commit 795225d378ad · 2025-10-12T15:49:42.000+02:00
diff --git a/docs/declarative-pipelines/PipelinesHandler.md b/docs/declarative-pipelines/PipelinesHandler.md
@@ -24,11 +24,11 @@ handlePipelinesCommand(
 
 | PipelineCommand | Description | Initiator |
 |-----------------|-------------|-----------|
-| `CREATE_DATAFLOW_GRAPH` | [Creates a new dataflow graph](#CREATE_DATAFLOW_GRAPH) | [pyspark.pipelines.spark_connect_pipeline](#create_dataflow_graph) |
+| `CREATE_DATAFLOW_GRAPH` | [Creates a new dataflow graph](#CREATE_DATAFLOW_GRAPH) | [pyspark.pipelines.spark_connect_pipeline](spark_connect_pipeline.md#create_dataflow_graph) |
 | `DROP_DATAFLOW_GRAPH` | [Drops a pipeline](#DROP_DATAFLOW_GRAPH) ||
 | `DEFINE_DATASET` | [Defines a dataset](#DEFINE_DATASET) | [SparkConnectGraphElementRegistry](SparkConnectGraphElementRegistry.md#register_dataset) |
 | `DEFINE_FLOW` | [Defines a flow](#DEFINE_FLOW) | [SparkConnectGraphElementRegistry](SparkConnectGraphElementRegistry.md#register_flow) |
-| `START_RUN` | [Starts a pipeline run](#START_RUN) | [pyspark.pipelines.spark_connect_pipeline.start_run](#start_run) |
+| `START_RUN` | [Starts a pipeline run](#START_RUN) | [pyspark.pipelines.spark_connect_pipeline.start_run](spark_connect_pipeline.md#start_run) |
 | `DEFINE_SQL_GRAPH_ELEMENTS` | [DEFINE_SQL_GRAPH_ELEMENTS](#DEFINE_SQL_GRAPH_ELEMENTS) | [SparkConnectGraphElementRegistry](SparkConnectGraphElementRegistry.md#register_sql) |
 
 `handlePipelinesCommand` reports an `UnsupportedOperationException` for incorrect commands:
@@ -43,7 +43,7 @@ handlePipelinesCommand(
 
 * `SparkConnectPlanner` ([Spark Connect]({{ book.spark_connect }}/server/SparkConnectPlanner)) is requested to `handlePipelineCommand` (for `PIPELINE_COMMAND` command)
 
-### CREATE_DATAFLOW_GRAPH { #CREATE_DATAFLOW_GRAPH }
+### <span id="CreateDataflowGraph"> CREATE_DATAFLOW_GRAPH { #CREATE_DATAFLOW_GRAPH }
 
 [handlePipelinesCommand](#handlePipelinesCommand) creates a [dataflow graph](#createDataflowGraph) and sends the graph ID back.
 
@@ -113,9 +113,19 @@ createDataflowGraph(
   spark: SparkSession): String
 ```
 
-`createDataflowGraph` finds the catalog and the database in the given `cmd` command and [creates a dataflow graph](DataflowGraphRegistry.md#createDataflowGraph).
+`createDataflowGraph` gets the catalog (from the given `CreateDataflowGraph` if defined in the [pipeline specification file](index.md#pipeline-specification-file)) or prints out the following INFO message to the logs and uses the current catalog instead.
 
-`createDataflowGraph` returns the ID of the created dataflow graph.
+```text
+No default catalog was supplied. Falling back to the current catalog: [currentCatalog].
+```
+
+`createDataflowGraph` gets the database (from the given `CreateDataflowGraph` if defined in the [pipeline specification file](index.md#pipeline-specification-file)) or prints out the following INFO message to the logs and uses the current database instead.
+
+```text
+No default database was supplied. Falling back to the current database: [currentDatabase].
+```
+
+In the end, `createDataflowGraph` [creates a dataflow graph](DataflowGraphRegistry.md#createDataflowGraph) (in the session's [DataflowGraphRegistry](DataflowGraphRegistry.md)).
 
 ## defineSqlGraphElements { #defineSqlGraphElements }
 
diff --git a/docs/declarative-pipelines/SparkPipelines.md b/docs/declarative-pipelines/SparkPipelines.md
@@ -9,9 +9,11 @@ subtitle: Spark Pipelines CLI
 
 `SparkPipelines` is a Scala "launchpad" to execute [pyspark/pipelines/cli.py](#pyspark-pipelines-cli) Python script (through [SparkSubmit]({{ book.spark_core }}/tools/spark-submit/SparkSubmit/)).
 
-## PySpark Pipelines CLI
+## cli.py { #pyspark-pipelines-cli }
 
-`pyspark/pipelines/cli.py` is the Pipelines CLI that is launched using [spark-pipelines](./index.md#spark-pipelines) shell script.
+`pyspark/pipelines/cli.py` is the heart of the Spark Pipelines CLI (launched using [spark-pipelines](./index.md#spark-pipelines) shell script).
+
+As a Python script, `cli.py` can simply import Python libraries (to trigger their execution) whereas SQL libraries are left untouched and sent over the wire to a Spark Connect server ([PipelinesHandler](PipelinesHandler.md)) for execution.
 
 The Pipelines CLI supports the following commands:
 
diff --git a/docs/declarative-pipelines/configuration-properties.md b/docs/declarative-pipelines/configuration-properties.md
@@ -1,6 +1,6 @@
 # Configuration Properties
 
-**Configuration properties** (aka **settings**) for [Spark Declarative Pipelines](index.md).
+**Configuration properties** (aka **configs**) for [Spark Declarative Pipelines](index.md).
 
 ## <span id="PIPELINES_EVENT_QUEUE_CAPACITY"> event.queue.capacity { #spark.sql.pipelines.event.queue.capacity }
 
diff --git a/docs/declarative-pipelines/index.md b/docs/declarative-pipelines/index.md
@@ -46,33 +46,35 @@ Once described, a pipeline can be [started](PipelineExecution.md#runPipeline) (o
 
 The heart of a Declarative Pipelines project is a **pipeline specification file** (in YAML format).
 
-In the pipeline specification file, Declarative Pipelines developers specify files (`libraries`) with tables, views and flows (transformations) definitions in Python and SQL. A SDP project can use both languages simultaneously.
+In the pipeline specification file, Declarative Pipelines developers specify files (`libraries`) with tables, views and flows (transformations) definitions in [Python](#python) and [SQL](#sql). A SDP project can use both languages simultaneously.
 
 The following fields are supported:
 
 Field Name | Description
 -|-
- `name` (required) | |
- `catalog` | |
- `database` |  |
- `schema` | Alias of `database`. Used unless `database` is defined |
- `configuration` | |
- `libraries` | `glob`s of `include`s with SQL and Python transformations |
+ `name` (required) | &nbsp;
+ `catalog` | The default catalog to register datasets into.<br>Unless specified, [PipelinesHandler](PipelinesHandler.md#createDataflowGraph) falls back to the current catalog.
+ `database` | The default database to register datasets into<br>Unless specified, [PipelinesHandler](PipelinesHandler.md#createDataflowGraph) falls back to the current database.
+ `schema` | Alias of `database`. Used unless `database` is defined
+ `storage` | ⚠️ does not seem to be used
+ `configuration` | SparkSession configs<br>Spark Pipelines runtime uses the configs to build a new `SparkSession` when `run`.<br>[spark.sql.connect.serverStacktrace.enabled]({{ book.spark_connect }}/configuration-properties/#spark.sql.connect.serverStacktrace.enabled) is hardcoded to be always `false`.
+ `libraries` | `glob`s of `include`s with transformations in [SQL](#sql) and [Python](#python-decorators)
 
 ```yaml
 name: hello-spark-pipelines
 catalog: default_catalog
 schema: default
+storage: storage-root
 configuration:
   spark.key1: value1
 libraries:
   - glob:
       include: transformations/**
 ```
 
-## spark-pipelines Shell Script { #spark-pipelines }
+## Spark Pipelines CLI { #spark-pipelines }
 
-`spark-pipelines` shell script is used to launch [org.apache.spark.deploy.SparkPipelines](SparkPipelines.md).
+`spark-pipelines` shell script is the **Spark Pipelines CLI** (that launches [org.apache.spark.deploy.SparkPipelines](SparkPipelines.md) behind the scenes).
 
 ## Dataset Types
 
diff --git a/docs/declarative-pipelines/spark_connect_pipeline.md b/docs/declarative-pipelines/spark_connect_pipeline.md
@@ -0,0 +1,62 @@
+---
+title: spark_connect_pipeline
+---
+
+# spark_connect_pipeline PySpark Module
+
+## create_dataflow_graph { #create_dataflow_graph }
+
+```py
+create_dataflow_graph(
+    spark: SparkSession,
+    default_catalog: Optional[str],
+    default_database: Optional[str],
+    sql_conf: Optional[Mapping[str, str]],
+) -> str
+```
+
+`create_dataflow_graph`...FIXME
+
+---
+
+`create_dataflow_graph` is used when:
+
+* FIXME
+
+## start_run { #start_run }
+
+```py
+start_run(
+    spark: SparkSession,
+    dataflow_graph_id: str,
+    full_refresh: Optional[Sequence[str]],
+    full_refresh_all: bool,
+    refresh: Optional[Sequence[str]],
+    dry: bool,
+    storage: str,
+) -> Iterator[Dict[str, Any]]
+```
+
+`start_run`...FIXME
+
+---
+
+`start_run` is used when:
+
+* FIXME
+
+## handle_pipeline_events { #handle_pipeline_events }
+
+```py
+handle_pipeline_events(
+    iter: Iterator[Dict[str, Any]]
+) -> None
+```
+
+`handle_pipeline_events`...FIXME
+
+---
+
+`handle_pipeline_events` is used when:
+
+* FIXME