Skip to content

[HWORKS-2190] Describe the job configuration and how to access the filesystem in the docs (#471) #474

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 5, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 29 additions & 3 deletions docs/user_guides/projects/jobs/notebook_job.md
Original file line number Diff line number Diff line change
Expand Up @@ -82,7 +82,7 @@ It is possible to also set following configuration settings for a `PYTHON` job.
* `Environment`: The python environment to use
* `Container memory`: The amount of memory in MB to be allocated to the Jupyter Notebook script
* `Container cores`: The number of cores to be allocated for the Jupyter Notebook script
* `Additional files`: List of files that will be locally accessible by the application
* `Additional files`: List of files that will be locally accessible in the working directory of the application. Only recommended to use if project datasets are not mounted under `/hopsfs`.
You can always modify the arguments in the job settings.

<p align="center">
Expand Down Expand Up @@ -142,7 +142,7 @@ In this snippet we get the `JobsApi` object to get the default job configuration

```python

jobs_api = project.get_jobs_api()
jobs_api = project.get_job_api()

notebook_job_config = jobs_api.get_configuration("PYTHON")

Expand All @@ -166,7 +166,33 @@ In this code snippet, we execute the job with arguments and wait until it reache
execution = job.run(args='-p a 2 -p b 5', await_termination=True)
```

### API Reference
## Configuration
The following table describes the JSON payload returned by `jobs_api.get_configuration("PYTHON")`

| Field | Type | Description | Default |
|-------------------------|----------------|------------------------------------------------------|--------------------------|
| `type` | string | Type of the job configuration | `"pythonJobConfiguration"` |
| `appPath` | string | Project path to notebook (e.g `Resources/foo.ipynb`) | `null` |
| `environmentName` | string | Name of the python environment | `"pandas-training-pipeline"` |
| `resourceConfig.cores` | number (float) | Number of CPU cores to be allocated | `1.0` |
| `resourceConfig.memory` | number (int) | Number of MBs to be allocated | `2048` |
| `resourceConfig.gpus` | number (int) | Number of GPUs to be allocated | `0` |
| `logRedirection` | boolean | Whether logs are redirected | `true` |
| `jobType` | string | Type of job | `"PYTHON"` |


## Accessing project data
!!! notice "Recommended approach if `/hopsfs` is mounted"
If your Hopsworks installation is configured to mount the project datasets under `/hopsfs`, which it is in most cases, then please refer to this section instead of the `Additional files` property to reference file resources.

### Absolute paths
The project datasets are mounted under `/hopsfs`, so you can access `data.csv` from the `Resources` dataset using `/hopsfs/Resources/data.csv` in your notebook.

### Relative paths
The notebook's working directory is the folder it is located in. For example, if it is located in the `Resources` dataset, and you have a file named `data.csv` in that dataset, you simply access it using `data.csv`. Also, if you write a local file, for example `output.txt`, it will be saved in the `Resources` dataset.


## API Reference

[Jobs](https://docs.hopsworks.ai/hopsworks-api/{{{ hopsworks_version }}}/generated/api/jobs/)

Expand Down
44 changes: 41 additions & 3 deletions docs/user_guides/projects/jobs/pyspark_job.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ description: Documentation on how to configure and execute a PySpark job on Hops

All members of a project in Hopsworks can launch the following types of applications through a project's Jobs service:

- Python (*Hopsworks Enterprise only*)
- Python
- Apache Spark

Launching a job of any type is very similar process, what mostly differs between job types is
Expand Down Expand Up @@ -179,7 +179,7 @@ In this snippet we get the `JobsApi` object to get the default job configuration

```python

jobs_api = project.get_jobs_api()
jobs_api = project.get_job_api()

spark_config = jobs_api.get_configuration("PYSPARK")

Expand Down Expand Up @@ -211,7 +211,45 @@ print(f_err.read())

```

### API Reference
## Configuration
The following table describes the JSON payload returned by `jobs_api.get_configuration("PYSPARK")`

| Field | Type | Description | Default |
| ------------------------------------------ | -------------- |-----------------------------------------------------| -------------------------- |
| `type` | string | Type of the job configuration | `"sparkJobConfiguration"` |
| `appPath` | string | Project path to script (e.g `Resources/foo.py`) | `null` |
| `environmentName` | string | Name of the project spark environment | `"spark-feature-pipeline"` |
| `spark.driver.cores` | number (float) | Number of CPU cores allocated for the driver | `1.0` |
| `spark.driver.memory` | number (int) | Memory allocated for the driver (in MB) | `2048` |
| `spark.executor.instances` | number (int) | Number of executor instances | `1` |
| `spark.executor.cores` | number (float) | Number of CPU cores per executor | `1.0` |
| `spark.executor.memory` | number (int) | Memory allocated per executor (in MB) | `4096` |
| `spark.dynamicAllocation.enabled` | boolean | Enable dynamic allocation of executors | `true` |
| `spark.dynamicAllocation.minExecutors` | number (int) | Minimum number of executors with dynamic allocation | `1` |
| `spark.dynamicAllocation.maxExecutors` | number (int) | Maximum number of executors with dynamic allocation | `2` |
| `spark.dynamicAllocation.initialExecutors` | number (int) | Initial number of executors with dynamic allocation | `1` |
| `spark.blacklist.enabled` | boolean | Whether executor/node blacklisting is enabled | `false` |


## Accessing project data

### Read directly from the filesystem (recommended)

To read a dataset in your project using Spark, use the full filesystem path where the data is stored. For example, to read a CSV file named `data.csv` located in the `Resources` dataset of a project called `my_project`:

```python
df = spark.read.csv("/Projects/my_project/Resources/data.csv", header=True, inferSchema=True)
df.show()
```

### Additional files

Different file types can be attached to the spark job and made available in the `/srv/hops/artifacts` folder when the PySpark job is started. This configuration is mainly useful when you need to add additional setup, such as jars that needs to be added to the CLASSPATH.

When reading data in your Spark job it is recommended to use the Spark read API as previously demonstrated, since this reads from the filesystem directly, whereas `Additional files` configuration options will download the files in its entirety and is not a scalable option.


## API Reference

[Jobs](https://docs.hopsworks.ai/hopsworks-api/{{{ hopsworks_version }}}/generated/api/jobs/)

Expand Down
33 changes: 30 additions & 3 deletions docs/user_guides/projects/jobs/python_job.md
Original file line number Diff line number Diff line change
Expand Up @@ -81,7 +81,8 @@ It is possible to also set following configuration settings for a `PYTHON` job.
* `Environment`: The python environment to use
* `Container memory`: The amount of memory in MB to be allocated to the Python script
* `Container cores`: The number of cores to be allocated for the Python script
* `Additional files`: List of files that will be locally accessible by the application
* `Additional files`: List of files that will be locally accessible in the working directory of the application. Only recommended to use if project datasets are not mounted under `/hopsfs`.
You can always modify the arguments in the job settings.

<p align="center">
<figure>
Expand Down Expand Up @@ -129,7 +130,7 @@ In this snippet we get the `JobsApi` object to get the default job configuration

```python

jobs_api = project.get_jobs_api()
jobs_api = project.get_job_api()

py_job_config = jobs_api.get_configuration("PYTHON")

Expand Down Expand Up @@ -163,7 +164,33 @@ print(f_err.read())

```

### API Reference
## Configuration
The following table describes the JSON payload returned by `jobs_api.get_configuration("PYTHON")`

| Field | Type | Description | Default |
|-------------------------|----------------|-------------------------------------------------|--------------------------|
| `type` | string | Type of the job configuration | `"pythonJobConfiguration"` |
| `appPath` | string | Project path to script (e.g `Resources/foo.py`) | `null` |
| `environmentName` | string | Name of the project python environment | `"pandas-training-pipeline"` |
| `resourceConfig.cores` | number (float) | Number of CPU cores to be allocated | `1.0` |
| `resourceConfig.memory` | number (int) | Number of MBs to be allocated | `2048` |
| `resourceConfig.gpus` | number (int) | Number of GPUs to be allocated | `0` |
| `logRedirection` | boolean | Whether logs are redirected | `true` |
| `jobType` | string | Type of job | `"PYTHON"` |


## Accessing project data
!!! notice "Recommended approach if `/hopsfs` is mounted"
If your Hopsworks installation is configured to mount the project datasets under `/hopsfs`, which it is in most cases, then please refer to this section instead of the `Additional files` property to reference file resources.

### Absolute paths
The project datasets are mounted under `/hopsfs`, so you can access `data.csv` from the `Resources` dataset using `/hopsfs/Resources/data.csv` in your script.

### Relative paths
The script's working directory is the folder it is located in. For example, if it is located in the `Resources` dataset, and you have a file named `data.csv` in that dataset, you simply access it using `data.csv`. Also, if you write a local file, for example `output.txt`, it will be saved in the `Resources` dataset.


## API Reference

[Jobs](https://docs.hopsworks.ai/hopsworks-api/{{{ hopsworks_version }}}/generated/api/jobs/)

Expand Down
11 changes: 8 additions & 3 deletions docs/user_guides/projects/jobs/ray_job.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ description: Documentation on how to configure and execute a Ray job on Hopswork

All members of a project in Hopsworks can launch the following types of applications through a project's Jobs service:

- Python (*Hopsworks Enterprise only*)
- Python
- Apache Spark
- Ray

Expand Down Expand Up @@ -168,7 +168,7 @@ In this snippet we get the `JobsApi` object to get the default job configuration

```python

jobs_api = project.get_jobs_api()
jobs_api = project.get_job_api()

ray_config = jobs_api.get_configuration("RAY")

Expand Down Expand Up @@ -203,7 +203,12 @@ print(f_err.read())

```

### API Reference
## Accessing project data

The project datasets are mounted under `/home/yarnapp/hopsfs`, so you can access `data.csv` from the `Resources` dataset using `/home/yarnapp/hopsfs/Resources/data.csv` in your script.


## API Reference

[Jobs](https://docs.hopsworks.ai/hopsworks-api/{{{ hopsworks_version }}}/generated/api/jobs/)

Expand Down
45 changes: 43 additions & 2 deletions docs/user_guides/projects/jobs/spark_job.md
Original file line number Diff line number Diff line change
Expand Up @@ -183,7 +183,7 @@ In this snippet we get the `JobsApi` object to get the default job configuration

```python

jobs_api = project.get_jobs_api()
jobs_api = project.get_job_api()

spark_config = jobs_api.get_configuration("SPARK")

Expand Down Expand Up @@ -212,7 +212,48 @@ print(f_err.read())

```

### API Reference
## Configuration
The following table describes the JSON payload returned by `jobs_api.get_configuration("SPARK")`

| Field | Type | Description | Default |
|--------------------------------------------| -------------- |---------------------------------------------------------| -------------------------- |
| `type` | string | Type of the job configuration | `"sparkJobConfiguration"` |
| `appPath` | string | Project path to spark program (e.g `Resources/foo.jar`) | `null` |
| `mainClass` | string | Name of the main class to run (e.g `org.company.Main`) | `null` |
| `environmentName` | string | Name of the project spark environment | `"spark-feature-pipeline"` |
| `spark.driver.cores` | number (float) | Number of CPU cores allocated for the driver | `1.0` |
| `spark.driver.memory` | number (int) | Memory allocated for the driver (in MB) | `2048` |
| `spark.executor.instances` | number (int) | Number of executor instances | `1` |
| `spark.executor.cores` | number (float) | Number of CPU cores per executor | `1.0` |
| `spark.executor.memory` | number (int) | Memory allocated per executor (in MB) | `4096` |
| `spark.dynamicAllocation.enabled` | boolean | Enable dynamic allocation of executors | `true` |
| `spark.dynamicAllocation.minExecutors` | number (int) | Minimum number of executors with dynamic allocation | `1` |
| `spark.dynamicAllocation.maxExecutors` | number (int) | Maximum number of executors with dynamic allocation | `2` |
| `spark.dynamicAllocation.initialExecutors` | number (int) | Initial number of executors with dynamic allocation | `1` |
| `spark.blacklist.enabled` | boolean | Whether executor/node blacklisting is enabled | `false` |

## Accessing project data

### Read directly from the filesystem (recommended)

To read a dataset in your project using Spark, use the full filesystem path where the data is stored. For example, to read a CSV file named `data.csv` located in the `Resources` dataset of a project called `my_project`:

```java
Dataset<Row> df = spark.read()
.option("header", "true") // CSV has header
.option("inferSchema", "true") // Infer data types
.csv("/Projects/my_project/Resources/data.csv");

df.show();
```

### Additional files

Different file types can be attached to the spark job and made available in the `/srv/hops/artifacts` folder when the Spark job is started. This configuration is mainly useful when you need to add additional configuration such as jars that needs to be added to the CLASSPATH.

When reading data in your Spark job it is recommended to use the Spark read API as previously demonstrated, since this reads from the filesystem directly, whereas `Additional files` configuration options will download the files in its entirety and is not a scalable option.

## API Reference

[Jobs](https://docs.hopsworks.ai/hopsworks-api/{{{ hopsworks_version }}}/generated/api/jobs/)

Expand Down
13 changes: 12 additions & 1 deletion docs/user_guides/projects/jupyter/python_notebook.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
Jupyter is provided as a service in Hopsworks, providing the same user experience and features as if run on your laptop.

* Supports JupyterLab and the classic Jupyter front-end
* Configured with Python and PySpark kernels
* Configured with Python3, PySpark and Ray kernels

## Step 1: Jupyter dashboard

Expand Down Expand Up @@ -82,6 +82,17 @@ Start the Jupyter instance by clicking the `Run Jupyter` button.
</figure>
</p>

## Accessing project data
!!! notice "Recommended approach if `/hopsfs` is mounted"
If your Hopsworks installation is configured to mount the project datasets under `/hopsfs`, which it is in most cases, then please refer to this section.
If the file system is not mounted, then project files can be localized using the [download api](https://docs.hopsworks.ai/hopsworks-api/{{{ hopsworks_version }}}/generated/api/datasets/#download) to localize files in the current working directory.

### Absolute paths
The project datasets are mounted under `/hopsfs`, so you can access `data.csv` from the `Resources` dataset using `/hopsfs/Resources/data.csv` in your notebook.

### Relative paths
The notebook's working directory is the folder it is located in. For example, if it is located in the `Resources` dataset, and you have a file named `data.csv` in that dataset, you simply access it using `data.csv`. Also, if you write a local file, for example `output.txt`, it will be saved in the `Resources` dataset.


## Going Further

Expand Down
6 changes: 5 additions & 1 deletion docs/user_guides/projects/jupyter/ray_notebook.md
Original file line number Diff line number Diff line change
Expand Up @@ -139,4 +139,8 @@ In the Ray Dashboard, you can monitor the resources used by code you are runnin
<img src="../../../../assets/images/guides/jupyter/ray_jupyter_notebook_session.png" alt="Access Ray Dashboard">
<figcaption>Access Ray Dashboard for Jupyter Ray session</figcaption>
</figure>
</p>
</p>

## Accessing project data

The project datasets are mounted under `/home/yarnapp/hopsfs` in the Ray containers, so you can access `data.csv` from the `Resources` dataset using `/home/yarnapp/hopsfs/Resources/data.csv`.
17 changes: 17 additions & 0 deletions docs/user_guides/projects/jupyter/spark_notebook.md
Original file line number Diff line number Diff line change
Expand Up @@ -135,6 +135,23 @@ Navigate back to Hopsworks and a Spark session will have appeared, click on the
</figure>
</p>

## Accessing project data

### Read directly from the filesystem (recommended)

To read a dataset in your project using Spark, use the full filesystem path where the data is stored. For example, to read a CSV file named `data.csv` located in the `Resources` dataset of a project called `my_project`:

```python
df = spark.read.csv("/Projects/my_project/Resources/data.csv", header=True, inferSchema=True)
df.show()
```

### Additional files

Different files can be attached to the jupyter session and made available in the `/srv/hops/artifacts` folder when the PySpark kernel is started. This configuration is mainly useful when you need to add additional configuration such as jars that needs to be added to the CLASSPATH.

When reading data in your Spark application, it is recommended to use the Spark read API as previously demonstrated, since this reads from the filesystem directly, whereas `Additional files` configuration options will download the files in its entirety and is not a scalable option.

## Going Further

You can learn how to [install a library](../python/python_install.md) so that it can be used in a notebook.
4 changes: 2 additions & 2 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -152,11 +152,11 @@ nav:
- Run Ray Notebook: user_guides/projects/jupyter/ray_notebook.md
- Remote Filesystem Driver: user_guides/projects/jupyter/remote_filesystem_driver.md
- Jobs:
- Run Python Job: user_guides/projects/jobs/python_job.md
- Run Jupyter Notebook Job: user_guides/projects/jobs/notebook_job.md
- Run PySpark Job: user_guides/projects/jobs/pyspark_job.md
- Run Spark Job: user_guides/projects/jobs/spark_job.md
- Run Python Job: user_guides/projects/jobs/python_job.md
- Run Ray Job: user_guides/projects/jobs/ray_job.md
- Run Jupyter Notebook Job: user_guides/projects/jobs/notebook_job.md
- Scheduling: user_guides/projects/jobs/schedule_job.md
- Kubernetes Scheduling: user_guides/projects/scheduling/kube_scheduler.md
- Airflow: user_guides/projects/airflow/airflow.md
Expand Down