Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
17 commits
Select commit Hold shift + click to select a range
622ff63
feat: add Amazon EMR Serverless task plugin
norrishuang Mar 14, 2026
db7f4c7
test: add unit tests for EMR Serverless task plugin
norrishuang Mar 14, 2026
1bd61c7
fix: add shade plugin and TS type defs for EMR Serverless task
norrishuang Mar 14, 2026
f7293cf
fix: add EMR_SERVERLESS to store/project/task-type.ts TASK_TYPES_MAP
norrishuang Mar 14, 2026
c39c062
fix: ignore endpoint config for EMR Serverless client
norrishuang Mar 14, 2026
155fea6
docs: add EMR Serverless task icon and documentation (zh/en)
norrishuang Mar 14, 2026
13a7b82
docs: add EMR Serverless screenshots
norrishuang Mar 14, 2026
8978147
fix: add EMR_SERVERLESS to cloud task type config for UI sidebar display
norrishuang Mar 16, 2026
7c292a1
style: fix spotless format issues in emr-serverless docs
norrishuang Mar 16, 2026
76882a5
style: fix eslint line-length issues in emr-serverless i18n strings
norrishuang Mar 17, 2026
44f43eb
test: enhance EmrServerlessTask unit tests with comprehensive mock co…
norrishuang Mar 21, 2026
b96944c
test: add api-test for EMR Serverless task plugin using WireMock
norrishuang Mar 23, 2026
d9e04f6
ci: retrigger CI for PR #18069
norrishuang Mar 27, 2026
e1d1d76
test: address review comments - add CI registration, failed case, upd…
norrishuang Apr 1, 2026
15e17d2
test: fix WireMock urlPattern conflict, add failed workflow test case…
norrishuang Apr 1, 2026
cf19524
style: fix spotless format violation in EmrServerlessTaskAPITest
norrishuang Apr 1, 2026
0d2d0d3
style: apply spotless format from root pom
norrishuang Apr 1, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .github/workflows/api-test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -117,6 +117,8 @@ jobs:
class: org.apache.dolphinscheduler.api.test.cases.GrpcTaskAPITest
- name: OidcLoginAPITest
class: org.apache.dolphinscheduler.api.test.cases.OidcLoginAPITest
- name: EmrServerlessTaskAPITest
class: org.apache.dolphinscheduler.api.test.cases.tasks.EmrServerlessTaskAPITest
env:
RECORDING_PATH: /tmp/recording-${{ matrix.case.name }}
steps:
Expand Down
8 changes: 8 additions & 0 deletions docs/configs/docsdev.js
Original file line number Diff line number Diff line change
Expand Up @@ -153,6 +153,10 @@ export default {
title: 'Amazon EMR',
link: '/en-us/docs/dev/user_doc/guide/task/emr.html',
},
{
title: 'Amazon EMR Serverless',
link: '/en-us/docs/dev/user_doc/guide/task/emr-serverless.html',
},
{
title: 'Apache Zeppelin',
link: '/en-us/docs/dev/user_doc/guide/task/zeppelin.html',
Expand Down Expand Up @@ -881,6 +885,10 @@ export default {
title: 'Amazon EMR',
link: '/zh-cn/docs/dev/user_doc/guide/task/emr.html',
},
{
title: 'Amazon EMR Serverless',
link: '/zh-cn/docs/dev/user_doc/guide/task/emr-serverless.html',
},
{
title: 'Apache Zeppelin',
link: '/zh-cn/docs/dev/user_doc/guide/task/zeppelin.html',
Expand Down
144 changes: 144 additions & 0 deletions docs/docs/en/guide/task/emr-serverless.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,144 @@
# Amazon EMR Serverless

## Overview

Amazon EMR Serverless task type, for submitting and monitoring job runs on [Amazon EMR Serverless](https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/emr-serverless.html) applications.
Unlike traditional EMR on EC2, EMR Serverless requires no cluster infrastructure management and automatically scales compute resources on demand, suitable for Spark and Hive workloads.

Using [aws-java-sdk](https://aws.amazon.com/cn/sdk-for-java/) in the background code, to transfer JSON parameters to a [StartJobRunRequest](https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/emrserverless/model/StartJobRunRequest.html) object
and submit it to AWS via the [StartJobRun API](https://docs.aws.amazon.com/emr-serverless/latest/APIReference/API_StartJobRun.html), then poll job status via the [GetJobRun API](https://docs.aws.amazon.com/emr-serverless/latest/APIReference/API_GetJobRun.html) until completion.

## Create Task

- Click `Project Management -> Project Name -> Workflow Definition`, click the `Create Workflow` button to enter the DAG editing page.
- Drag `AmazonEMRServerless` task from the toolbar to the artboard to complete the creation.

## Task Parameters

[//]: # (TODO: use the commented anchor below once our website template supports this syntax)
[//]: # (- Please refer to [DolphinScheduler Task Parameters Appendix](appendix.md#default-task-parameters) `Default Task Parameters` section for default parameters.)

- Please refer to [DolphinScheduler Task Parameters Appendix](appendix.md) `Default Task Parameters` section for default parameters.

| **Parameter** | **Description** |
|-------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Application Id | EMR Serverless application ID (e.g. `00fkht2eodujab09`), obtainable from the [EMR Serverless Console](https://console.aws.amazon.com/emr/home#/serverless) |
| Execution Role Arn | ARN of the IAM role for job execution (e.g. `arn:aws:iam::123456789012:role/EMRServerlessRole`), this role needs permissions to access S3, Glue, and other services |
| Job Name | Job name (optional), used to identify the job in the EMR Serverless console |
| StartJobRunRequest JSON | JSON corresponding to the `JobDriver` and `ConfigurationOverrides` portions of the [StartJobRunRequest](https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/emrserverless/model/StartJobRunRequest.html), see examples below. **Note**: `ApplicationId` and `ExecutionRoleArn` do not need to be included in the JSON as they are automatically injected from the form parameters above |

![RUN_JOB_FLOW](../../../../img/tasks/demo/emr_serverless_create.png)

## Task Example

### Submit a Spark Job

This example shows how to create an `EMR_SERVERLESS` task node to submit a Spark job to an EMR Serverless application.

StartJobRunRequest JSON example (Spark):

```json
{
"JobDriver": {
"SparkSubmit": {
"EntryPoint": "s3://my-bucket/scripts/my-spark-job.jar",
"EntryPointArguments": [
"s3://my-bucket/input/",
"s3://my-bucket/output/"
],
"SparkSubmitParameters": "--class com.example.MySparkApp --conf spark.executor.cores=4 --conf spark.executor.memory=8g --conf spark.executor.instances=10"
}
},
"ConfigurationOverrides": {
"MonitoringConfiguration": {
"S3MonitoringConfiguration": {
"LogUri": "s3://my-bucket/emr-serverless-logs/"
}
}
}
}
```

### Submit a Hive Job

This example shows how to create an `EMR_SERVERLESS` task node to submit a Hive query job.

StartJobRunRequest JSON example (Hive):

```json
{
"JobDriver": {
"HiveSQL": {
"Query": "s3://my-bucket/scripts/my-hive-query.sql",
"Parameters": "--hiveconf hive.exec.dynamic.partition=true --hiveconf hive.exec.dynamic.partition.mode=nonstrict"
}
},
"ConfigurationOverrides": {
"MonitoringConfiguration": {
"S3MonitoringConfiguration": {
"LogUri": "s3://my-bucket/emr-serverless-logs/"
}
},
"ApplicationConfiguration": [
{
"Classification": "hive-site",
"Properties": {
"hive.metastore.client.factory.class": "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory"
}
}
]
}
}
```

## AWS Authentication Configuration

The EMR Serverless task reads AWS credentials from the DolphinScheduler `aws.yaml` configuration file, under the `aws.emr` section at `conf/aws.yaml`.

### Using IAM Role (Recommended)

If the DolphinScheduler Worker node runs on an EC2 instance with an attached IAM Role:

```yaml
aws:
emr:
credentials.provider.type: InstanceProfileCredentialsProvider
region: us-east-1
```

### Using Access Key

If you need to authenticate using AK/SK:

```yaml
aws:
emr:
credentials.provider.type: AWSStaticCredentialsProvider
access.key.id: your-access-key-id
access.key.secret: your-secret-access-key
region: us-east-1
```

> **Note**: The `aws.emr` section configuration is shared by both EMR on EC2 and EMR Serverless task types.

## Job State Transitions

After an EMR Serverless job is submitted, DolphinScheduler polls the job status every 10 seconds:

```
SUBMITTED → PENDING → SCHEDULED → RUNNING → SUCCESS
→ FAILED
→ CANCELLED
```

- When a job reaches `SUCCESS` state, the task is marked as successful
- When a job reaches `FAILED` or `CANCELLED` state, the task is marked as failed
- If a DolphinScheduler task is killed, it automatically calls the [CancelJobRun API](https://docs.aws.amazon.com/emr-serverless/latest/APIReference/API_CancelJobRun.html) to cancel the running job

## Notice

- The **Application Id** must correspond to a pre-existing EMR Serverless application (created via the AWS Console or API) in `STARTED` or `CREATED` state
- The **Execution Role** requires the following minimum permissions: `emr-serverless:StartJobRun`, `emr-serverless:GetJobRun`, `emr-serverless:CancelJobRun`, plus S3, Glue and other data access permissions required by the job
- `StartJobRunRequest JSON` should NOT include `ApplicationId` or `ExecutionRoleArn` fields — they are automatically injected from the form parameters
- EMR Serverless task supports failover: when a Worker node fails, a new Worker can recover tracking of running jobs through `appIds` (the `jobRunId`)

144 changes: 144 additions & 0 deletions docs/docs/zh/guide/task/emr-serverless.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,144 @@
# Amazon EMR Serverless

## 综述

Amazon EMR Serverless 任务类型,用于向 [Amazon EMR Serverless](https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/emr-serverless.html) 应用程序提交并监控作业运行。
与传统的 EMR on EC2 不同,EMR Serverless 无需管理集群基础设施,按需自动扩缩容计算资源,适用于 Spark 和 Hive 工作负载。

后台使用 [aws-java-sdk](https://aws.amazon.com/cn/sdk-for-java/) 将 JSON 参数转换为 [StartJobRunRequest](https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/emrserverless/model/StartJobRunRequest.html) 对象,
通过 [StartJobRun API](https://docs.aws.amazon.com/emr-serverless/latest/APIReference/API_StartJobRun.html) 提交到 AWS,并通过 [GetJobRun API](https://docs.aws.amazon.com/emr-serverless/latest/APIReference/API_GetJobRun.html) 轮询作业状态直到完成。

## 创建任务

- 点击 `项目管理 -> 项目名称 -> 工作流定义`,点击 `创建工作流` 按钮进入 DAG 编辑页面。
- 从工具栏中拖拽 `AmazonEMRServerless` 任务到画布中完成创建。

## 任务参数

[//]: # (TODO: use the commented anchor below once our website template supports this syntax)
[//]: # (- 默认参数说明请参考[DolphinScheduler任务参数附录](appendix.md#默认任务参数)`默认任务参数`一栏。)

- 默认参数说明请参考[DolphinScheduler任务参数附录](appendix.md)`默认任务参数`一栏。

| **任务参数** | **描述** |
|-------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Application Id | EMR Serverless 应用程序 ID(格式如 `00fkht2eodujab09`),可在 [EMR Serverless 控制台](https://console.aws.amazon.com/emr/home#/serverless) 获取 |
| Execution Role Arn | 作业执行 IAM 角色的 ARN(格式如 `arn:aws:iam::123456789012:role/EMRServerlessRole`),该角色需要有访问 S3、Glue 等服务的权限 |
| Job Name | 作业名称(可选),用于在 EMR Serverless 控制台中标识作业 |
| StartJobRunRequest JSON | [StartJobRunRequest](https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/emrserverless/model/StartJobRunRequest.html) 中 `JobDriver` 和 `ConfigurationOverrides` 部分对应的 JSON,详细定义见下方示例。**注意**:`ApplicationId` 和 `ExecutionRoleArn` 无需在 JSON 中重复填写,系统会自动从上方参数注入 |

![RUN_JOB_FLOW](../../../../img/tasks/demo/emr_serverless_create.png)

## 任务样例

### 提交 Spark 作业

该样例展示了如何创建 `EMR_SERVERLESS` 任务节点来提交一个 Spark 作业到 EMR Serverless 应用程序。

StartJobRunRequest JSON 参数样例(Spark):

```json
{
"JobDriver": {
"SparkSubmit": {
"EntryPoint": "s3://my-bucket/scripts/my-spark-job.jar",
"EntryPointArguments": [
"s3://my-bucket/input/",
"s3://my-bucket/output/"
],
"SparkSubmitParameters": "--class com.example.MySparkApp --conf spark.executor.cores=4 --conf spark.executor.memory=8g --conf spark.executor.instances=10"
}
},
"ConfigurationOverrides": {
"MonitoringConfiguration": {
"S3MonitoringConfiguration": {
"LogUri": "s3://my-bucket/emr-serverless-logs/"
}
}
}
}
```

### 提交 Hive 作业

该样例展示了如何创建 `EMR_SERVERLESS` 任务节点来提交一个 Hive 查询作业。

StartJobRunRequest JSON 参数样例(Hive):

```json
{
"JobDriver": {
"HiveSQL": {
"Query": "s3://my-bucket/scripts/my-hive-query.sql",
"Parameters": "--hiveconf hive.exec.dynamic.partition=true --hiveconf hive.exec.dynamic.partition.mode=nonstrict"
}
},
"ConfigurationOverrides": {
"MonitoringConfiguration": {
"S3MonitoringConfiguration": {
"LogUri": "s3://my-bucket/emr-serverless-logs/"
}
},
"ApplicationConfiguration": [
{
"Classification": "hive-site",
"Properties": {
"hive.metastore.client.factory.class": "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory"
}
}
]
}
}
```

## AWS 认证配置

EMR Serverless 任务通过 DolphinScheduler 的 `aws.yaml` 配置文件读取 AWS 认证信息,配置路径为 `conf/aws.yaml` 中的 `aws.emr` 段。

### 使用 IAM Role(推荐)

如果 DolphinScheduler Worker 节点运行在 EC2 实例上并已绑定 IAM Role,配置如下:

```yaml
aws:
emr:
credentials.provider.type: InstanceProfileCredentialsProvider
region: us-east-1
```

### 使用 Access Key

如果需要使用 AK/SK 方式认证:

```yaml
aws:
emr:
credentials.provider.type: AWSStaticCredentialsProvider
access.key.id: your-access-key-id
access.key.secret: your-secret-access-key
region: us-east-1
```

> **注意**:`aws.emr` 段的配置同时被 EMR on EC2 和 EMR Serverless 任务类型共享。

## 作业状态流转

EMR Serverless 作业提交后,DolphinScheduler 会每 10 秒轮询一次作业状态:

```
SUBMITTED → PENDING → SCHEDULED → RUNNING → SUCCESS
→ FAILED
→ CANCELLED
```

- 作业进入 `SUCCESS` 状态时,任务标记为成功
- 作业进入 `FAILED` 或 `CANCELLED` 状态时,任务标记为失败
- 如果 DolphinScheduler 任务被终止,会自动调用 [CancelJobRun API](https://docs.aws.amazon.com/emr-serverless/latest/APIReference/API_CancelJobRun.html) 取消正在运行的作业

## 注意事项

- **Application Id** 对应的 EMR Serverless 应用程序需要预先在 AWS 控制台或通过 API 创建,并确保处于 `STARTED` 或 `CREATED` 状态
- **Execution Role** 需要有以下最小权限:`emr-serverless:StartJobRun`、`emr-serverless:GetJobRun`、`emr-serverless:CancelJobRun`,以及作业所需的 S3、Glue 等数据访问权限
- `StartJobRunRequest JSON` 中无需填写 `ApplicationId` 和 `ExecutionRoleArn` 字段,系统会自动从表单参数注入
- EMR Serverless 任务支持故障转移(Failover):当 Worker 节点发生故障时,新的 Worker 可以通过 `appIds`(即 `jobRunId`)恢复对正在运行作业的跟踪

Binary file added docs/img/tasks/demo/emr_serverless_create.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Loading