[HWORKS-2190] Describe the job configuration and how to access the filesystem in the docs (#471) (#475)

robzor92 · web-flow · commit 410a3a73ef43 · 2025-06-05T09:18:15.000+02:00
diff --git a/docs/user_guides/projects/jobs/notebook_job.md b/docs/user_guides/projects/jobs/notebook_job.md
@@ -82,7 +82,7 @@ It is possible to also set following configuration settings for a `PYTHON` job.
 * `Environment`: The python environment to use
 * `Container memory`: The amount of memory in MB to be allocated to the Jupyter Notebook script
 * `Container cores`: The number of cores to be allocated for the Jupyter Notebook script
-* `Additional files`: List of files that will be locally accessible by the application
+* `Additional files`: List of files that will be locally accessible in the working directory of the application. Only recommended to use if project datasets are not mounted under `/hopsfs`.
 You can always modify the arguments in the job settings.
 
 <p align="center">
@@ -142,7 +142,7 @@ In this snippet we get the `JobsApi` object to get the default job configuration
 
 ```python
 
-jobs_api = project.get_jobs_api()
+jobs_api = project.get_job_api()
 
 notebook_job_config = jobs_api.get_configuration("PYTHON")
 
@@ -166,7 +166,33 @@ In this code snippet, we execute the job with arguments and wait until it reache
 execution = job.run(args='-p a 2 -p b 5', await_termination=True)
 ```
 
-### API Reference
+## Configuration
+The following table describes the JSON payload returned by `jobs_api.get_configuration("PYTHON")`
+
+| Field                   | Type           | Description                                          | Default                 |
+|-------------------------|----------------|------------------------------------------------------|--------------------------|
+| `type`                  | string         | Type of the job configuration                        | `"pythonJobConfiguration"` |
+| `appPath`               | string         | Project path to notebook (e.g `Resources/foo.ipynb`) | `null`            |
+| `environmentName`       | string         | Name of the python environment                       | `"pandas-training-pipeline"` |
+| `resourceConfig.cores`  | number (float) | Number of CPU cores to be allocated                  | `1.0`                    |
+| `resourceConfig.memory` | number (int)   | Number of MBs to be allocated                        | `2048`                   |
+| `resourceConfig.gpus`   | number (int)   | Number of GPUs to be allocated                       | `0`                      |
+| `logRedirection`        | boolean        | Whether logs are redirected                          | `true`                   |
+| `jobType`               | string         | Type of job                                          | `"PYTHON"`               |
+
+
+## Accessing project data
+!!! notice "Recommended approach if `/hopsfs` is mounted"
+    If your Hopsworks installation is configured to mount the project datasets under `/hopsfs`, which it is in most cases, then please refer to this section instead of the `Additional files` property to reference file resources.
+
+### Absolute paths
+The project datasets are mounted under `/hopsfs`, so you can access `data.csv` from the `Resources` dataset using `/hopsfs/Resources/data.csv` in your notebook.
+
+### Relative paths
+The notebook's working directory is the folder it is located in. For example, if it is located in the `Resources` dataset, and you have a file named `data.csv` in that dataset, you simply access it using `data.csv`. Also, if you write a local file, for example `output.txt`, it will be saved in the `Resources` dataset.
+
+
+## API Reference
 
 [Jobs](https://docs.hopsworks.ai/hopsworks-api/{{{ hopsworks_version }}}/generated/api/jobs/)
 
diff --git a/docs/user_guides/projects/jobs/pyspark_job.md b/docs/user_guides/projects/jobs/pyspark_job.md
@@ -8,7 +8,7 @@ description: Documentation on how to configure and execute a PySpark job on Hops
 
 All members of a project in Hopsworks can launch the following types of applications through a project's Jobs service:
 
-- Python (*Hopsworks Enterprise only*)
+- Python
 - Apache Spark
 
 Launching a job of any type is very similar process, what mostly differs between job types is
@@ -179,7 +179,7 @@ In this snippet we get the `JobsApi` object to get the default job configuration
 
 ```python
 
-jobs_api = project.get_jobs_api()
+jobs_api = project.get_job_api()
 
 spark_config = jobs_api.get_configuration("PYSPARK")
 
@@ -211,7 +211,45 @@ print(f_err.read())
 
 ```
 
-### API Reference
+## Configuration
+The following table describes the JSON payload returned by `jobs_api.get_configuration("PYSPARK")`
+
+| Field                                      | Type           | Description                                         | Default                    |
+| ------------------------------------------ | -------------- |-----------------------------------------------------| -------------------------- |
+| `type`                                     | string         | Type of the job configuration                       | `"sparkJobConfiguration"`  |
+| `appPath`               | string         | Project path to script (e.g `Resources/foo.py`) | `null`            |
+| `environmentName`                          | string         | Name of the project spark environment               | `"spark-feature-pipeline"` |
+| `spark.driver.cores`                       | number (float) | Number of CPU cores allocated for the driver        | `1.0`                      |
+| `spark.driver.memory`                      | number (int)   | Memory allocated for the driver (in MB)             | `2048`                     |
+| `spark.executor.instances`                 | number (int)   | Number of executor instances                        | `1`                        |
+| `spark.executor.cores`                     | number (float) | Number of CPU cores per executor                    | `1.0`                      |
+| `spark.executor.memory`                    | number (int)   | Memory allocated per executor (in MB)               | `4096`                     |
+| `spark.dynamicAllocation.enabled`          | boolean        | Enable dynamic allocation of executors              | `true`                     |
+| `spark.dynamicAllocation.minExecutors`     | number (int)   | Minimum number of executors with dynamic allocation | `1`                        |
+| `spark.dynamicAllocation.maxExecutors`     | number (int)   | Maximum number of executors with dynamic allocation | `2`                        |
+| `spark.dynamicAllocation.initialExecutors` | number (int)   | Initial number of executors with dynamic allocation | `1`                        |
+| `spark.blacklist.enabled`                  | boolean        | Whether executor/node blacklisting is enabled       | `false`                    |
+
+
+## Accessing project data
+
+### Read directly from the filesystem (recommended)
+
+To read a dataset in your project using Spark, use the full filesystem path where the data is stored. For example, to read a CSV file named `data.csv` located in the `Resources` dataset of a project called `my_project`:
+
+```python
+df = spark.read.csv("/Projects/my_project/Resources/data.csv", header=True, inferSchema=True)
+df.show()
+```
+
+### Additional files
+
+Different file types can be attached to the spark job and made available in the `/srv/hops/artifacts` folder when the PySpark job is started. This configuration is mainly useful when you need to add additional setup, such as jars that needs to be added to the CLASSPATH.
+
+When reading data in your Spark job it is recommended to use the Spark read API as previously demonstrated, since this reads from the filesystem directly, whereas `Additional files` configuration options will download the files in its entirety and is not a scalable option.
+
+
+## API Reference
 
 [Jobs](https://docs.hopsworks.ai/hopsworks-api/{{{ hopsworks_version }}}/generated/api/jobs/)
 
diff --git a/docs/user_guides/projects/jobs/python_job.md b/docs/user_guides/projects/jobs/python_job.md
@@ -81,7 +81,8 @@ It is possible to also set following configuration settings for a `PYTHON` job.
 * `Environment`: The python environment to use
 * `Container memory`: The amount of memory in MB to be allocated to the Python script
 * `Container cores`: The number of cores to be allocated for the Python script
-* `Additional files`: List of files that will be locally accessible by the application
+* `Additional files`: List of files that will be locally accessible in the working directory of the application. Only recommended to use if project datasets are not mounted under `/hopsfs`.
+  You can always modify the arguments in the job settings.
 
 <p align="center">
   <figure>
@@ -129,7 +130,7 @@ In this snippet we get the `JobsApi` object to get the default job configuration
 
 ```python
 
-jobs_api = project.get_jobs_api()
+jobs_api = project.get_job_api()
 
 py_job_config = jobs_api.get_configuration("PYTHON")
 
@@ -163,7 +164,33 @@ print(f_err.read())
 
 ```
 
-### API Reference
+## Configuration
+The following table describes the JSON payload returned by `jobs_api.get_configuration("PYTHON")`
+
+| Field                   | Type           | Description                                     | Default                 |
+|-------------------------|----------------|-------------------------------------------------|--------------------------|
+| `type`                  | string         | Type of the job configuration                   | `"pythonJobConfiguration"` |
+| `appPath`               | string         | Project path to script (e.g `Resources/foo.py`) | `null`            |
+| `environmentName`       | string         | Name of the project python environment          | `"pandas-training-pipeline"` |
+| `resourceConfig.cores`  | number (float) | Number of CPU cores to be allocated             | `1.0`                    |
+| `resourceConfig.memory` | number (int)   | Number of MBs to be allocated                   | `2048`                   |
+| `resourceConfig.gpus`   | number (int)   | Number of GPUs to be allocated                  | `0`                      |
+| `logRedirection`        | boolean        | Whether logs are redirected                     | `true`                   |
+| `jobType`               | string         | Type of job                                     | `"PYTHON"`               |
+
+
+## Accessing project data
+!!! notice "Recommended approach if `/hopsfs` is mounted"
+    If your Hopsworks installation is configured to mount the project datasets under `/hopsfs`, which it is in most cases, then please refer to this section instead of the `Additional files` property to reference file resources.
+
+### Absolute paths
+The project datasets are mounted under `/hopsfs`, so you can access `data.csv` from the `Resources` dataset using `/hopsfs/Resources/data.csv` in your script.
+
+### Relative paths
+The script's working directory is the folder it is located in. For example, if it is located in the `Resources` dataset, and you have a file named `data.csv` in that dataset, you simply access it using `data.csv`. Also, if you write a local file, for example `output.txt`, it will be saved in the `Resources` dataset.
+
+
+## API Reference
 
 [Jobs](https://docs.hopsworks.ai/hopsworks-api/{{{ hopsworks_version }}}/generated/api/jobs/)
 
diff --git a/docs/user_guides/projects/jobs/ray_job.md b/docs/user_guides/projects/jobs/ray_job.md
@@ -8,7 +8,7 @@ description: Documentation on how to configure and execute a Ray job on Hopswork
 
 All members of a project in Hopsworks can launch the following types of applications through a project's Jobs service:
 
-- Python (*Hopsworks Enterprise only*)
+- Python
 - Apache Spark
 - Ray 
 
@@ -168,7 +168,7 @@ In this snippet we get the `JobsApi` object to get the default job configuration
 
 ```python
 
-jobs_api = project.get_jobs_api()
+jobs_api = project.get_job_api()
 
 ray_config = jobs_api.get_configuration("RAY")
 
@@ -203,7 +203,12 @@ print(f_err.read())
 
 ```
 
-### API Reference
+## Accessing project data
+
+The project datasets are mounted under `/home/yarnapp/hopsfs`, so you can access `data.csv` from the `Resources` dataset using `/home/yarnapp/hopsfs/Resources/data.csv` in your script.
+
+
+## API Reference
 
 [Jobs](https://docs.hopsworks.ai/hopsworks-api/{{{ hopsworks_version }}}/generated/api/jobs/)
 
diff --git a/docs/user_guides/projects/jobs/spark_job.md b/docs/user_guides/projects/jobs/spark_job.md
@@ -183,7 +183,7 @@ In this snippet we get the `JobsApi` object to get the default job configuration
 
 ```python
 
-jobs_api = project.get_jobs_api()
+jobs_api = project.get_job_api()
 
 spark_config = jobs_api.get_configuration("SPARK")
 
@@ -212,7 +212,48 @@ print(f_err.read())
 
 ```
 
-### API Reference
+## Configuration
+The following table describes the JSON payload returned by `jobs_api.get_configuration("SPARK")`
+
+| Field                                      | Type           | Description                                             | Default                    |
+|--------------------------------------------| -------------- |---------------------------------------------------------| -------------------------- |
+| `type`                                     | string         | Type of the job configuration                           | `"sparkJobConfiguration"`  |
+| `appPath`                                  | string         | Project path to spark program (e.g `Resources/foo.jar`) | `null`            |
+| `mainClass`                                | string         | Name of the main class to run  (e.g `org.company.Main`) | `null`            |
+| `environmentName`                          | string         | Name of the project spark environment                   | `"spark-feature-pipeline"` |
+| `spark.driver.cores`                       | number (float) | Number of CPU cores allocated for the driver            | `1.0`                      |
+| `spark.driver.memory`                      | number (int)   | Memory allocated for the driver (in MB)                 | `2048`                     |
+| `spark.executor.instances`                 | number (int)   | Number of executor instances                            | `1`                        |
+| `spark.executor.cores`                     | number (float) | Number of CPU cores per executor                        | `1.0`                      |
+| `spark.executor.memory`                    | number (int)   | Memory allocated per executor (in MB)                   | `4096`                     |
+| `spark.dynamicAllocation.enabled`          | boolean        | Enable dynamic allocation of executors                  | `true`                     |
+| `spark.dynamicAllocation.minExecutors`     | number (int)   | Minimum number of executors with dynamic allocation     | `1`                        |
+| `spark.dynamicAllocation.maxExecutors`     | number (int)   | Maximum number of executors with dynamic allocation     | `2`                        |
+| `spark.dynamicAllocation.initialExecutors` | number (int)   | Initial number of executors with dynamic allocation     | `1`                        |
+| `spark.blacklist.enabled`                  | boolean        | Whether executor/node blacklisting is enabled           | `false`                    |
+
+## Accessing project data
+
+### Read directly from the filesystem (recommended)
+
+To read a dataset in your project using Spark, use the full filesystem path where the data is stored. For example, to read a CSV file named `data.csv` located in the `Resources` dataset of a project called `my_project`:
+
+```java
+Dataset<Row> df = spark.read()
+    .option("header", "true")       // CSV has header
+    .option("inferSchema", "true")  // Infer data types
+    .csv("/Projects/my_project/Resources/data.csv");
+
+df.show();
+```
+
+### Additional files
+
+Different file types can be attached to the spark job and made available in the `/srv/hops/artifacts` folder when the Spark job is started. This configuration is mainly useful when you need to add additional configuration such as jars that needs to be added to the CLASSPATH. 
+
+When reading data in your Spark job it is recommended to use the Spark read API as previously demonstrated, since this reads from the filesystem directly, whereas `Additional files` configuration options will download the files in its entirety and is not a scalable option.
+
+## API Reference
 
 [Jobs](https://docs.hopsworks.ai/hopsworks-api/{{{ hopsworks_version }}}/generated/api/jobs/)
 
diff --git a/docs/user_guides/projects/jupyter/python_notebook.md b/docs/user_guides/projects/jupyter/python_notebook.md
@@ -5,7 +5,7 @@
 Jupyter is provided as a service in Hopsworks, providing the same user experience and features as if run on your laptop.
 
 * Supports JupyterLab and the classic Jupyter front-end
-* Configured with Python and PySpark kernels
+* Configured with Python3, PySpark and Ray kernels
 
 ## Step 1: Jupyter dashboard
 
@@ -82,6 +82,17 @@ Start the Jupyter instance by clicking the `Run Jupyter` button.
   </figure>
 </p>
 
+## Accessing project data
+!!! notice "Recommended approach if `/hopsfs` is mounted"
+    If your Hopsworks installation is configured to mount the project datasets under `/hopsfs`, which it is in most cases, then please refer to this section.
+    If the file system is not mounted, then project files can be localized using the [download api](https://docs.hopsworks.ai/hopsworks-api/{{{ hopsworks_version }}}/generated/api/datasets/#download) to localize files in the current working directory.
+
+### Absolute paths
+The project datasets are mounted under `/hopsfs`, so you can access `data.csv` from the `Resources` dataset using `/hopsfs/Resources/data.csv` in your notebook.
+
+### Relative paths
+The notebook's working directory is the folder it is located in. For example, if it is located in the `Resources` dataset, and you have a file named `data.csv` in that dataset, you simply access it using `data.csv`. Also, if you write a local file, for example `output.txt`, it will be saved in the `Resources` dataset.
+
 
 ## Going Further
 
diff --git a/docs/user_guides/projects/jupyter/ray_notebook.md b/docs/user_guides/projects/jupyter/ray_notebook.md
@@ -139,4 +139,8 @@ In the Ray Dashboard, you can monitor the resources used  by code you are runnin
     <img src="../../../../assets/images/guides/jupyter/ray_jupyter_notebook_session.png" alt="Access Ray Dashboard">
     <figcaption>Access Ray Dashboard for Jupyter Ray session</figcaption>
   </figure>
-</p>
+</p>
+
+## Accessing project data
+
+The project datasets are mounted under `/home/yarnapp/hopsfs` in the Ray containers, so you can access `data.csv` from the `Resources` dataset using `/home/yarnapp/hopsfs/Resources/data.csv`.
diff --git a/docs/user_guides/projects/jupyter/spark_notebook.md b/docs/user_guides/projects/jupyter/spark_notebook.md
@@ -135,6 +135,23 @@ Navigate back to Hopsworks and a Spark session will have appeared, click on the
   </figure>
 </p>
 
+## Accessing project data
+
+### Read directly from the filesystem (recommended)
+
+To read a dataset in your project using Spark, use the full filesystem path where the data is stored. For example, to read a CSV file named `data.csv` located in the `Resources` dataset of a project called `my_project`:
+
+```python
+df = spark.read.csv("/Projects/my_project/Resources/data.csv", header=True, inferSchema=True)
+df.show()
+```
+
+### Additional files
+
+Different files can be attached to the jupyter session and made available in the `/srv/hops/artifacts` folder when the PySpark kernel is started. This configuration is mainly useful when you need to add additional configuration such as jars that needs to be added to the CLASSPATH.
+
+When reading data in your Spark application, it is recommended to use the Spark read API as previously demonstrated, since this reads from the filesystem directly, whereas `Additional files` configuration options will download the files in its entirety and is not a scalable option.
+
 ## Going Further
 
 You can learn how to [install a library](../python/python_install.md) so that it can be used in a notebook.
diff --git a/mkdocs.yml b/mkdocs.yml
@@ -152,11 +152,11 @@ nav:
               - Run Ray Notebook: user_guides/projects/jupyter/ray_notebook.md
               - Remote Filesystem Driver: user_guides/projects/jupyter/remote_filesystem_driver.md
           - Jobs:
+              - Run Python Job: user_guides/projects/jobs/python_job.md
+              - Run Jupyter Notebook Job: user_guides/projects/jobs/notebook_job.md
               - Run PySpark Job: user_guides/projects/jobs/pyspark_job.md
               - Run Spark Job: user_guides/projects/jobs/spark_job.md
-              - Run Python Job: user_guides/projects/jobs/python_job.md
               - Run Ray Job: user_guides/projects/jobs/ray_job.md
-              - Run Jupyter Notebook Job: user_guides/projects/jobs/notebook_job.md
               - Scheduling: user_guides/projects/jobs/schedule_job.md
           - Kubernetes Scheduling: user_guides/projects/scheduling/kube_scheduler.md
           - Airflow: user_guides/projects/airflow/airflow.md