Skip to content
Open
Show file tree
Hide file tree
Changes from 7 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions docs/configs/docsdev.js
Original file line number Diff line number Diff line change
Expand Up @@ -153,6 +153,10 @@ export default {
title: 'Amazon EMR',
link: '/en-us/docs/dev/user_doc/guide/task/emr.html',
},
{
title: 'Amazon EMR Serverless',
link: '/en-us/docs/dev/user_doc/guide/task/emr-serverless.html',
},
{
title: 'Apache Zeppelin',
link: '/en-us/docs/dev/user_doc/guide/task/zeppelin.html',
Expand Down Expand Up @@ -881,6 +885,10 @@ export default {
title: 'Amazon EMR',
link: '/zh-cn/docs/dev/user_doc/guide/task/emr.html',
},
{
title: 'Amazon EMR Serverless',
link: '/zh-cn/docs/dev/user_doc/guide/task/emr-serverless.html',
},
{
title: 'Apache Zeppelin',
link: '/zh-cn/docs/dev/user_doc/guide/task/zeppelin.html',
Expand Down
144 changes: 144 additions & 0 deletions docs/docs/en/guide/task/emr-serverless.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,144 @@
# Amazon EMR Serverless

## Overview

Amazon EMR Serverless task type, for submitting and monitoring job runs on [Amazon EMR Serverless](https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/emr-serverless.html) applications.
Unlike traditional EMR on EC2, EMR Serverless requires no cluster infrastructure management and automatically scales compute resources on demand, suitable for Spark and Hive workloads.

Using [aws-java-sdk](https://aws.amazon.com/cn/sdk-for-java/) in the background code, to transfer JSON parameters to a [StartJobRunRequest](https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/emrserverless/model/StartJobRunRequest.html) object
and submit it to AWS via the [StartJobRun API](https://docs.aws.amazon.com/emr-serverless/latest/APIReference/API_StartJobRun.html), then poll job status via the [GetJobRun API](https://docs.aws.amazon.com/emr-serverless/latest/APIReference/API_GetJobRun.html) until completion.

## Create Task

- Click `Project Management -> Project Name -> Workflow Definition`, click the `Create Workflow` button to enter the DAG editing page.
- Drag `AmazonEMRServerless` task from the toolbar to the artboard to complete the creation.


## Task Parameters

[//]: # (TODO: use the commented anchor below once our website template supports this syntax)
[//]: # (- Please refer to [DolphinScheduler Task Parameters Appendix](appendix.md#default-task-parameters) `Default Task Parameters` section for default parameters.)

- Please refer to [DolphinScheduler Task Parameters Appendix](appendix.md) `Default Task Parameters` section for default parameters.

| **Parameter** | **Description** |
|-------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Application Id | EMR Serverless application ID (e.g. `00fkht2eodujab09`), obtainable from the [EMR Serverless Console](https://console.aws.amazon.com/emr/home#/serverless) |
| Execution Role Arn | ARN of the IAM role for job execution (e.g. `arn:aws:iam::123456789012:role/EMRServerlessRole`), this role needs permissions to access S3, Glue, and other services |
| Job Name | Job name (optional), used to identify the job in the EMR Serverless console |
| StartJobRunRequest JSON | JSON corresponding to the `JobDriver` and `ConfigurationOverrides` portions of the [StartJobRunRequest](https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/emrserverless/model/StartJobRunRequest.html), see examples below. **Note**: `ApplicationId` and `ExecutionRoleArn` do not need to be included in the JSON as they are automatically injected from the form parameters above |

![RUN_JOB_FLOW](../../../../img/tasks/demo/emr_serverless_create.png)

## Task Example

### Submit a Spark Job

This example shows how to create an `EMR_SERVERLESS` task node to submit a Spark job to an EMR Serverless application.

StartJobRunRequest JSON example (Spark):

```json
{
"JobDriver": {
"SparkSubmit": {
"EntryPoint": "s3://my-bucket/scripts/my-spark-job.jar",
"EntryPointArguments": [
"s3://my-bucket/input/",
"s3://my-bucket/output/"
],
"SparkSubmitParameters": "--class com.example.MySparkApp --conf spark.executor.cores=4 --conf spark.executor.memory=8g --conf spark.executor.instances=10"
}
},
"ConfigurationOverrides": {
"MonitoringConfiguration": {
"S3MonitoringConfiguration": {
"LogUri": "s3://my-bucket/emr-serverless-logs/"
}
}
}
}
```

### Submit a Hive Job

This example shows how to create an `EMR_SERVERLESS` task node to submit a Hive query job.

StartJobRunRequest JSON example (Hive):

```json
{
"JobDriver": {
"HiveSQL": {
"Query": "s3://my-bucket/scripts/my-hive-query.sql",
"Parameters": "--hiveconf hive.exec.dynamic.partition=true --hiveconf hive.exec.dynamic.partition.mode=nonstrict"
}
},
"ConfigurationOverrides": {
"MonitoringConfiguration": {
"S3MonitoringConfiguration": {
"LogUri": "s3://my-bucket/emr-serverless-logs/"
}
},
"ApplicationConfiguration": [
{
"Classification": "hive-site",
"Properties": {
"hive.metastore.client.factory.class": "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory"
}
}
]
}
}
```

## AWS Authentication Configuration

The EMR Serverless task reads AWS credentials from the DolphinScheduler `aws.yaml` configuration file, under the `aws.emr` section at `conf/aws.yaml`.

### Using IAM Role (Recommended)

If the DolphinScheduler Worker node runs on an EC2 instance with an attached IAM Role:

```yaml
aws:
emr:
credentials.provider.type: InstanceProfileCredentialsProvider
region: us-east-1
```

### Using Access Key

If you need to authenticate using AK/SK:

```yaml
aws:
emr:
credentials.provider.type: AWSStaticCredentialsProvider
access.key.id: your-access-key-id
access.key.secret: your-secret-access-key
region: us-east-1
```

> **Note**: The `aws.emr` section configuration is shared by both EMR on EC2 and EMR Serverless task types.

## Job State Transitions

After an EMR Serverless job is submitted, DolphinScheduler polls the job status every 10 seconds:

```
SUBMITTED → PENDING → SCHEDULED → RUNNING → SUCCESS
→ FAILED
→ CANCELLED
```

- When a job reaches `SUCCESS` state, the task is marked as successful
- When a job reaches `FAILED` or `CANCELLED` state, the task is marked as failed
- If a DolphinScheduler task is killed, it automatically calls the [CancelJobRun API](https://docs.aws.amazon.com/emr-serverless/latest/APIReference/API_CancelJobRun.html) to cancel the running job

## Notice

- The **Application Id** must correspond to a pre-existing EMR Serverless application (created via the AWS Console or API) in `STARTED` or `CREATED` state
- The **Execution Role** requires the following minimum permissions: `emr-serverless:StartJobRun`, `emr-serverless:GetJobRun`, `emr-serverless:CancelJobRun`, plus S3, Glue and other data access permissions required by the job
- `StartJobRunRequest JSON` should NOT include `ApplicationId` or `ExecutionRoleArn` fields — they are automatically injected from the form parameters
- EMR Serverless task supports failover: when a Worker node fails, a new Worker can recover tracking of running jobs through `appIds` (the `jobRunId`)
147 changes: 147 additions & 0 deletions docs/docs/zh/guide/task/emr-serverless.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,147 @@
# Amazon EMR Serverless

## 综述

Amazon EMR Serverless 任务类型,用于向 [Amazon EMR Serverless](https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/emr-serverless.html) 应用程序提交并监控作业运行。
与传统的 EMR on EC2 不同,EMR Serverless 无需管理集群基础设施,按需自动扩缩容计算资源,适用于 Spark 和 Hive 工作负载。

后台使用 [aws-java-sdk](https://aws.amazon.com/cn/sdk-for-java/) 将 JSON 参数转换为 [StartJobRunRequest](https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/emrserverless/model/StartJobRunRequest.html) 对象,
通过 [StartJobRun API](https://docs.aws.amazon.com/emr-serverless/latest/APIReference/API_StartJobRun.html) 提交到 AWS,并通过 [GetJobRun API](https://docs.aws.amazon.com/emr-serverless/latest/APIReference/API_GetJobRun.html) 轮询作业状态直到完成。

## 创建任务

- 点击 `项目管理 -> 项目名称 -> 工作流定义`,点击 `创建工作流` 按钮进入 DAG 编辑页面。
- 从工具栏中拖拽 `AmazonEMRServerless` 任务到画布中完成创建。



## 任务参数

[//]: # (TODO: use the commented anchor below once our website template supports this syntax)
[//]: # (- 默认参数说明请参考[DolphinScheduler任务参数附录](appendix.md#默认任务参数)`默认任务参数`一栏。)

- 默认参数说明请参考[DolphinScheduler任务参数附录](appendix.md)`默认任务参数`一栏。

| **任务参数** | **描述** |
|--------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Application Id | EMR Serverless 应用程序 ID(格式如 `00fkht2eodujab09`),可在 [EMR Serverless 控制台](https://console.aws.amazon.com/emr/home#/serverless) 获取 |
| Execution Role Arn | 作业执行 IAM 角色的 ARN(格式如 `arn:aws:iam::123456789012:role/EMRServerlessRole`),该角色需要有访问 S3、Glue 等服务的权限 |
| Job Name | 作业名称(可选),用于在 EMR Serverless 控制台中标识作业 |
| StartJobRunRequest JSON | [StartJobRunRequest](https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/emrserverless/model/StartJobRunRequest.html) 中 `JobDriver` 和 `ConfigurationOverrides` 部分对应的 JSON,详细定义见下方示例。**注意**:`ApplicationId` 和 `ExecutionRoleArn` 无需在 JSON 中重复填写,系统会自动从上方参数注入 |

![RUN_JOB_FLOW](../../../../img/tasks/demo/emr_serverless_create.png)


## 任务样例

### 提交 Spark 作业

该样例展示了如何创建 `EMR_SERVERLESS` 任务节点来提交一个 Spark 作业到 EMR Serverless 应用程序。


StartJobRunRequest JSON 参数样例(Spark):

```json
{
"JobDriver": {
"SparkSubmit": {
"EntryPoint": "s3://my-bucket/scripts/my-spark-job.jar",
"EntryPointArguments": [
"s3://my-bucket/input/",
"s3://my-bucket/output/"
],
"SparkSubmitParameters": "--class com.example.MySparkApp --conf spark.executor.cores=4 --conf spark.executor.memory=8g --conf spark.executor.instances=10"
}
},
"ConfigurationOverrides": {
"MonitoringConfiguration": {
"S3MonitoringConfiguration": {
"LogUri": "s3://my-bucket/emr-serverless-logs/"
}
}
}
}
```

### 提交 Hive 作业

该样例展示了如何创建 `EMR_SERVERLESS` 任务节点来提交一个 Hive 查询作业。

StartJobRunRequest JSON 参数样例(Hive):

```json
{
"JobDriver": {
"HiveSQL": {
"Query": "s3://my-bucket/scripts/my-hive-query.sql",
"Parameters": "--hiveconf hive.exec.dynamic.partition=true --hiveconf hive.exec.dynamic.partition.mode=nonstrict"
}
},
"ConfigurationOverrides": {
"MonitoringConfiguration": {
"S3MonitoringConfiguration": {
"LogUri": "s3://my-bucket/emr-serverless-logs/"
}
},
"ApplicationConfiguration": [
{
"Classification": "hive-site",
"Properties": {
"hive.metastore.client.factory.class": "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory"
}
}
]
}
}
```

## AWS 认证配置

EMR Serverless 任务通过 DolphinScheduler 的 `aws.yaml` 配置文件读取 AWS 认证信息,配置路径为 `conf/aws.yaml` 中的 `aws.emr` 段。

### 使用 IAM Role(推荐)

如果 DolphinScheduler Worker 节点运行在 EC2 实例上并已绑定 IAM Role,配置如下:

```yaml
aws:
emr:
credentials.provider.type: InstanceProfileCredentialsProvider
region: us-east-1
```

### 使用 Access Key

如果需要使用 AK/SK 方式认证:

```yaml
aws:
emr:
credentials.provider.type: AWSStaticCredentialsProvider
access.key.id: your-access-key-id
access.key.secret: your-secret-access-key
region: us-east-1
```

> **注意**:`aws.emr` 段的配置同时被 EMR on EC2 和 EMR Serverless 任务类型共享。

## 作业状态流转

EMR Serverless 作业提交后,DolphinScheduler 会每 10 秒轮询一次作业状态:

```
SUBMITTED → PENDING → SCHEDULED → RUNNING → SUCCESS
→ FAILED
→ CANCELLED
```

- 作业进入 `SUCCESS` 状态时,任务标记为成功
- 作业进入 `FAILED` 或 `CANCELLED` 状态时,任务标记为失败
- 如果 DolphinScheduler 任务被终止,会自动调用 [CancelJobRun API](https://docs.aws.amazon.com/emr-serverless/latest/APIReference/API_CancelJobRun.html) 取消正在运行的作业

## 注意事项

- **Application Id** 对应的 EMR Serverless 应用程序需要预先在 AWS 控制台或通过 API 创建,并确保处于 `STARTED` 或 `CREATED` 状态
- **Execution Role** 需要有以下最小权限:`emr-serverless:StartJobRun`、`emr-serverless:GetJobRun`、`emr-serverless:CancelJobRun`,以及作业所需的 S3、Glue 等数据访问权限
- `StartJobRunRequest JSON` 中无需填写 `ApplicationId` 和 `ExecutionRoleArn` 字段,系统会自动从表单参数注入
- EMR Serverless 任务支持故障转移(Failover):当 Worker 节点发生故障时,新的 Worker 可以通过 `appIds`(即 `jobRunId`)恢复对正在运行作业的跟踪
Binary file added docs/img/tasks/demo/emr_serverless_create.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
5 changes: 5 additions & 0 deletions dolphinscheduler-bom/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -752,6 +752,11 @@
<artifactId>aws-java-sdk-emr</artifactId>
<version>${aws-sdk.version}</version>
</dependency>
<dependency>
<groupId>com.amazonaws</groupId>
<artifactId>aws-java-sdk-emrserverless</artifactId>
<version>${aws-sdk.version}</version>
</dependency>
<dependency>
<groupId>com.amazonaws</groupId>
<artifactId>aws-java-sdk-s3</artifactId>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -112,6 +112,12 @@
<version>${project.version}</version>
</dependency>

<dependency>
<groupId>org.apache.dolphinscheduler</groupId>
<artifactId>dolphinscheduler-task-emr-serverless</artifactId>
<version>${project.version}</version>
</dependency>

<dependency>
<groupId>org.apache.dolphinscheduler</groupId>
<artifactId>dolphinscheduler-task-zeppelin</artifactId>
Expand Down
Loading
Loading