You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/declarative-pipelines/PipelinesHandler.md
+15-5Lines changed: 15 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -24,11 +24,11 @@ handlePipelinesCommand(
24
24
25
25
| PipelineCommand | Description | Initiator |
26
26
|-----------------|-------------|-----------|
27
-
|`CREATE_DATAFLOW_GRAPH`|[Creates a new dataflow graph](#CREATE_DATAFLOW_GRAPH)|[pyspark.pipelines.spark_connect_pipeline](#create_dataflow_graph)|
27
+
|`CREATE_DATAFLOW_GRAPH`|[Creates a new dataflow graph](#CREATE_DATAFLOW_GRAPH)|[pyspark.pipelines.spark_connect_pipeline](spark_connect_pipeline.md#create_dataflow_graph)|
28
28
|`DROP_DATAFLOW_GRAPH`|[Drops a pipeline](#DROP_DATAFLOW_GRAPH)||
29
29
|`DEFINE_DATASET`|[Defines a dataset](#DEFINE_DATASET)|[SparkConnectGraphElementRegistry](SparkConnectGraphElementRegistry.md#register_dataset)|
30
30
|`DEFINE_FLOW`|[Defines a flow](#DEFINE_FLOW)|[SparkConnectGraphElementRegistry](SparkConnectGraphElementRegistry.md#register_flow)|
31
-
|`START_RUN`|[Starts a pipeline run](#START_RUN)|[pyspark.pipelines.spark_connect_pipeline.start_run](#start_run)|
31
+
|`START_RUN`|[Starts a pipeline run](#START_RUN)|[pyspark.pipelines.spark_connect_pipeline.start_run](spark_connect_pipeline.md#start_run)|
[handlePipelinesCommand](#handlePipelinesCommand) creates a [dataflow graph](#createDataflowGraph) and sends the graph ID back.
49
49
@@ -113,9 +113,19 @@ createDataflowGraph(
113
113
spark: SparkSession):String
114
114
```
115
115
116
-
`createDataflowGraph`finds the catalog and the database in the given `cmd` command and [creates a dataflow graph](DataflowGraphRegistry.md#createDataflowGraph).
116
+
`createDataflowGraph`gets the catalog (from the given `CreateDataflowGraph` if defined in the [pipeline specification file](index.md#pipeline-specification-file)) or prints out the following INFO message to the logs and uses the current catalog instead.
117
117
118
-
`createDataflowGraph` returns the ID of the created dataflow graph.
118
+
```text
119
+
No default catalog was supplied. Falling back to the current catalog: [currentCatalog].
120
+
```
121
+
122
+
`createDataflowGraph` gets the database (from the given `CreateDataflowGraph` if defined in the [pipeline specification file](index.md#pipeline-specification-file)) or prints out the following INFO message to the logs and uses the current database instead.
123
+
124
+
```text
125
+
No default database was supplied. Falling back to the current database: [currentDatabase].
126
+
```
127
+
128
+
In the end, `createDataflowGraph`[creates a dataflow graph](DataflowGraphRegistry.md#createDataflowGraph) (in the session's [DataflowGraphRegistry](DataflowGraphRegistry.md)).
Copy file name to clipboardExpand all lines: docs/declarative-pipelines/SparkPipelines.md
+4-2Lines changed: 4 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -9,9 +9,11 @@ subtitle: Spark Pipelines CLI
9
9
10
10
`SparkPipelines` is a Scala "launchpad" to execute [pyspark/pipelines/cli.py](#pyspark-pipelines-cli) Python script (through [SparkSubmit]({{ book.spark_core }}/tools/spark-submit/SparkSubmit/)).
11
11
12
-
## PySpark Pipelines CLI
12
+
## cli.py { #pyspark-pipelines-cli }
13
13
14
-
`pyspark/pipelines/cli.py` is the Pipelines CLI that is launched using [spark-pipelines](./index.md#spark-pipelines) shell script.
14
+
`pyspark/pipelines/cli.py` is the heart of the Spark Pipelines CLI (launched using [spark-pipelines](./index.md#spark-pipelines) shell script).
15
+
16
+
As a Python script, `cli.py` can simply import Python libraries (to trigger their execution) whereas SQL libraries are left untouched and sent over the wire to a Spark Connect server ([PipelinesHandler](PipelinesHandler.md)) for execution.
15
17
16
18
The Pipelines CLI supports the following commands:
Copy file name to clipboardExpand all lines: docs/declarative-pipelines/index.md
+11-9Lines changed: 11 additions & 9 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -46,33 +46,35 @@ Once described, a pipeline can be [started](PipelineExecution.md#runPipeline) (o
46
46
47
47
The heart of a Declarative Pipelines project is a **pipeline specification file** (in YAML format).
48
48
49
-
In the pipeline specification file, Declarative Pipelines developers specify files (`libraries`) with tables, views and flows (transformations) definitions in Python and SQL. A SDP project can use both languages simultaneously.
49
+
In the pipeline specification file, Declarative Pipelines developers specify files (`libraries`) with tables, views and flows (transformations) definitions in [Python](#python) and [SQL](#sql). A SDP project can use both languages simultaneously.
50
50
51
51
The following fields are supported:
52
52
53
53
Field Name | Description
54
54
-|-
55
-
`name` (required) | |
56
-
`catalog` | |
57
-
`database` | |
58
-
`schema` | Alias of `database`. Used unless `database` is defined |
59
-
`configuration` | |
60
-
`libraries` | `glob`s of `include`s with SQL and Python transformations |
55
+
`name` (required) |
56
+
`catalog` | The default catalog to register datasets into.<br>Unless specified, [PipelinesHandler](PipelinesHandler.md#createDataflowGraph) falls back to the current catalog.
57
+
`database` | The default database to register datasets into<br>Unless specified, [PipelinesHandler](PipelinesHandler.md#createDataflowGraph) falls back to the current database.
58
+
`schema` | Alias of `database`. Used unless `database` is defined
59
+
`storage` | ⚠️ does not seem to be used
60
+
`configuration` | SparkSession configs<br>Spark Pipelines runtime uses the configs to build a new `SparkSession` when `run`.<br>[spark.sql.connect.serverStacktrace.enabled]({{ book.spark_connect }}/configuration-properties/#spark.sql.connect.serverStacktrace.enabled) is hardcoded to be always `false`.
61
+
`libraries` | `glob`s of `include`s with transformations in [SQL](#sql) and [Python](#python-decorators)
`spark-pipelines` shell script is used to launch [org.apache.spark.deploy.SparkPipelines](SparkPipelines.md).
77
+
`spark-pipelines` shell script is the **Spark Pipelines CLI** (that launches [org.apache.spark.deploy.SparkPipelines](SparkPipelines.md) behind the scenes).
0 commit comments