From 0b2541f209d16e23767582f21b15ef4aba0b826f Mon Sep 17 00:00:00 2001 From: Robin Andersson Date: Thu, 5 Jun 2025 09:12:27 +0200 Subject: [PATCH] [HWORKS-2190] Describe the job configuration and how to access the filesystem in the docs (#471) --- .../user_guides/projects/jobs/notebook_job.md | 32 +++++++++++-- docs/user_guides/projects/jobs/pyspark_job.md | 44 ++++++++++++++++-- docs/user_guides/projects/jobs/python_job.md | 33 ++++++++++++-- docs/user_guides/projects/jobs/ray_job.md | 11 +++-- docs/user_guides/projects/jobs/spark_job.md | 45 ++++++++++++++++++- .../projects/jupyter/python_notebook.md | 13 +++++- .../projects/jupyter/ray_notebook.md | 6 ++- .../projects/jupyter/spark_notebook.md | 17 +++++++ mkdocs.yml | 4 +- 9 files changed, 187 insertions(+), 18 deletions(-) diff --git a/docs/user_guides/projects/jobs/notebook_job.md b/docs/user_guides/projects/jobs/notebook_job.md index a17788651..364b5900e 100644 --- a/docs/user_guides/projects/jobs/notebook_job.md +++ b/docs/user_guides/projects/jobs/notebook_job.md @@ -82,7 +82,7 @@ It is possible to also set following configuration settings for a `PYTHON` job. * `Environment`: The python environment to use * `Container memory`: The amount of memory in MB to be allocated to the Jupyter Notebook script * `Container cores`: The number of cores to be allocated for the Jupyter Notebook script -* `Additional files`: List of files that will be locally accessible by the application +* `Additional files`: List of files that will be locally accessible in the working directory of the application. Only recommended to use if project datasets are not mounted under `/hopsfs`. You can always modify the arguments in the job settings.

@@ -142,7 +142,7 @@ In this snippet we get the `JobsApi` object to get the default job configuration ```python -jobs_api = project.get_jobs_api() +jobs_api = project.get_job_api() notebook_job_config = jobs_api.get_configuration("PYTHON") @@ -166,7 +166,33 @@ In this code snippet, we execute the job with arguments and wait until it reache execution = job.run(args='-p a 2 -p b 5', await_termination=True) ``` -### API Reference +## Configuration +The following table describes the JSON payload returned by `jobs_api.get_configuration("PYTHON")` + +| Field | Type | Description | Default | +|-------------------------|----------------|------------------------------------------------------|--------------------------| +| `type` | string | Type of the job configuration | `"pythonJobConfiguration"` | +| `appPath` | string | Project path to notebook (e.g `Resources/foo.ipynb`) | `null` | +| `environmentName` | string | Name of the python environment | `"pandas-training-pipeline"` | +| `resourceConfig.cores` | number (float) | Number of CPU cores to be allocated | `1.0` | +| `resourceConfig.memory` | number (int) | Number of MBs to be allocated | `2048` | +| `resourceConfig.gpus` | number (int) | Number of GPUs to be allocated | `0` | +| `logRedirection` | boolean | Whether logs are redirected | `true` | +| `jobType` | string | Type of job | `"PYTHON"` | + + +## Accessing project data +!!! notice "Recommended approach if `/hopsfs` is mounted" + If your Hopsworks installation is configured to mount the project datasets under `/hopsfs`, which it is in most cases, then please refer to this section instead of the `Additional files` property to reference file resources. + +### Absolute paths +The project datasets are mounted under `/hopsfs`, so you can access `data.csv` from the `Resources` dataset using `/hopsfs/Resources/data.csv` in your notebook. + +### Relative paths +The notebook's working directory is the folder it is located in. For example, if it is located in the `Resources` dataset, and you have a file named `data.csv` in that dataset, you simply access it using `data.csv`. Also, if you write a local file, for example `output.txt`, it will be saved in the `Resources` dataset. + + +## API Reference [Jobs](https://docs.hopsworks.ai/hopsworks-api/{{{ hopsworks_version }}}/generated/api/jobs/) diff --git a/docs/user_guides/projects/jobs/pyspark_job.md b/docs/user_guides/projects/jobs/pyspark_job.md index 3cc9e3030..c0cb7e804 100644 --- a/docs/user_guides/projects/jobs/pyspark_job.md +++ b/docs/user_guides/projects/jobs/pyspark_job.md @@ -8,7 +8,7 @@ description: Documentation on how to configure and execute a PySpark job on Hops All members of a project in Hopsworks can launch the following types of applications through a project's Jobs service: -- Python (*Hopsworks Enterprise only*) +- Python - Apache Spark Launching a job of any type is very similar process, what mostly differs between job types is @@ -179,7 +179,7 @@ In this snippet we get the `JobsApi` object to get the default job configuration ```python -jobs_api = project.get_jobs_api() +jobs_api = project.get_job_api() spark_config = jobs_api.get_configuration("PYSPARK") @@ -211,7 +211,45 @@ print(f_err.read()) ``` -### API Reference +## Configuration +The following table describes the JSON payload returned by `jobs_api.get_configuration("PYSPARK")` + +| Field | Type | Description | Default | +| ------------------------------------------ | -------------- |-----------------------------------------------------| -------------------------- | +| `type` | string | Type of the job configuration | `"sparkJobConfiguration"` | +| `appPath` | string | Project path to script (e.g `Resources/foo.py`) | `null` | +| `environmentName` | string | Name of the project spark environment | `"spark-feature-pipeline"` | +| `spark.driver.cores` | number (float) | Number of CPU cores allocated for the driver | `1.0` | +| `spark.driver.memory` | number (int) | Memory allocated for the driver (in MB) | `2048` | +| `spark.executor.instances` | number (int) | Number of executor instances | `1` | +| `spark.executor.cores` | number (float) | Number of CPU cores per executor | `1.0` | +| `spark.executor.memory` | number (int) | Memory allocated per executor (in MB) | `4096` | +| `spark.dynamicAllocation.enabled` | boolean | Enable dynamic allocation of executors | `true` | +| `spark.dynamicAllocation.minExecutors` | number (int) | Minimum number of executors with dynamic allocation | `1` | +| `spark.dynamicAllocation.maxExecutors` | number (int) | Maximum number of executors with dynamic allocation | `2` | +| `spark.dynamicAllocation.initialExecutors` | number (int) | Initial number of executors with dynamic allocation | `1` | +| `spark.blacklist.enabled` | boolean | Whether executor/node blacklisting is enabled | `false` | + + +## Accessing project data + +### Read directly from the filesystem (recommended) + +To read a dataset in your project using Spark, use the full filesystem path where the data is stored. For example, to read a CSV file named `data.csv` located in the `Resources` dataset of a project called `my_project`: + +```python +df = spark.read.csv("/Projects/my_project/Resources/data.csv", header=True, inferSchema=True) +df.show() +``` + +### Additional files + +Different file types can be attached to the spark job and made available in the `/srv/hops/artifacts` folder when the PySpark job is started. This configuration is mainly useful when you need to add additional setup, such as jars that needs to be added to the CLASSPATH. + +When reading data in your Spark job it is recommended to use the Spark read API as previously demonstrated, since this reads from the filesystem directly, whereas `Additional files` configuration options will download the files in its entirety and is not a scalable option. + + +## API Reference [Jobs](https://docs.hopsworks.ai/hopsworks-api/{{{ hopsworks_version }}}/generated/api/jobs/) diff --git a/docs/user_guides/projects/jobs/python_job.md b/docs/user_guides/projects/jobs/python_job.md index 4fa58cfa6..420e38e49 100644 --- a/docs/user_guides/projects/jobs/python_job.md +++ b/docs/user_guides/projects/jobs/python_job.md @@ -81,7 +81,8 @@ It is possible to also set following configuration settings for a `PYTHON` job. * `Environment`: The python environment to use * `Container memory`: The amount of memory in MB to be allocated to the Python script * `Container cores`: The number of cores to be allocated for the Python script -* `Additional files`: List of files that will be locally accessible by the application +* `Additional files`: List of files that will be locally accessible in the working directory of the application. Only recommended to use if project datasets are not mounted under `/hopsfs`. + You can always modify the arguments in the job settings.

@@ -129,7 +130,7 @@ In this snippet we get the `JobsApi` object to get the default job configuration ```python -jobs_api = project.get_jobs_api() +jobs_api = project.get_job_api() py_job_config = jobs_api.get_configuration("PYTHON") @@ -163,7 +164,33 @@ print(f_err.read()) ``` -### API Reference +## Configuration +The following table describes the JSON payload returned by `jobs_api.get_configuration("PYTHON")` + +| Field | Type | Description | Default | +|-------------------------|----------------|-------------------------------------------------|--------------------------| +| `type` | string | Type of the job configuration | `"pythonJobConfiguration"` | +| `appPath` | string | Project path to script (e.g `Resources/foo.py`) | `null` | +| `environmentName` | string | Name of the project python environment | `"pandas-training-pipeline"` | +| `resourceConfig.cores` | number (float) | Number of CPU cores to be allocated | `1.0` | +| `resourceConfig.memory` | number (int) | Number of MBs to be allocated | `2048` | +| `resourceConfig.gpus` | number (int) | Number of GPUs to be allocated | `0` | +| `logRedirection` | boolean | Whether logs are redirected | `true` | +| `jobType` | string | Type of job | `"PYTHON"` | + + +## Accessing project data +!!! notice "Recommended approach if `/hopsfs` is mounted" + If your Hopsworks installation is configured to mount the project datasets under `/hopsfs`, which it is in most cases, then please refer to this section instead of the `Additional files` property to reference file resources. + +### Absolute paths +The project datasets are mounted under `/hopsfs`, so you can access `data.csv` from the `Resources` dataset using `/hopsfs/Resources/data.csv` in your script. + +### Relative paths +The script's working directory is the folder it is located in. For example, if it is located in the `Resources` dataset, and you have a file named `data.csv` in that dataset, you simply access it using `data.csv`. Also, if you write a local file, for example `output.txt`, it will be saved in the `Resources` dataset. + + +## API Reference [Jobs](https://docs.hopsworks.ai/hopsworks-api/{{{ hopsworks_version }}}/generated/api/jobs/) diff --git a/docs/user_guides/projects/jobs/ray_job.md b/docs/user_guides/projects/jobs/ray_job.md index 99312f4a2..1b79a6f49 100644 --- a/docs/user_guides/projects/jobs/ray_job.md +++ b/docs/user_guides/projects/jobs/ray_job.md @@ -8,7 +8,7 @@ description: Documentation on how to configure and execute a Ray job on Hopswork All members of a project in Hopsworks can launch the following types of applications through a project's Jobs service: -- Python (*Hopsworks Enterprise only*) +- Python - Apache Spark - Ray @@ -168,7 +168,7 @@ In this snippet we get the `JobsApi` object to get the default job configuration ```python -jobs_api = project.get_jobs_api() +jobs_api = project.get_job_api() ray_config = jobs_api.get_configuration("RAY") @@ -203,7 +203,12 @@ print(f_err.read()) ``` -### API Reference +## Accessing project data + +The project datasets are mounted under `/home/yarnapp/hopsfs`, so you can access `data.csv` from the `Resources` dataset using `/home/yarnapp/hopsfs/Resources/data.csv` in your script. + + +## API Reference [Jobs](https://docs.hopsworks.ai/hopsworks-api/{{{ hopsworks_version }}}/generated/api/jobs/) diff --git a/docs/user_guides/projects/jobs/spark_job.md b/docs/user_guides/projects/jobs/spark_job.md index 66be8c001..6d0f0510b 100644 --- a/docs/user_guides/projects/jobs/spark_job.md +++ b/docs/user_guides/projects/jobs/spark_job.md @@ -183,7 +183,7 @@ In this snippet we get the `JobsApi` object to get the default job configuration ```python -jobs_api = project.get_jobs_api() +jobs_api = project.get_job_api() spark_config = jobs_api.get_configuration("SPARK") @@ -212,7 +212,48 @@ print(f_err.read()) ``` -### API Reference +## Configuration +The following table describes the JSON payload returned by `jobs_api.get_configuration("SPARK")` + +| Field | Type | Description | Default | +|--------------------------------------------| -------------- |---------------------------------------------------------| -------------------------- | +| `type` | string | Type of the job configuration | `"sparkJobConfiguration"` | +| `appPath` | string | Project path to spark program (e.g `Resources/foo.jar`) | `null` | +| `mainClass` | string | Name of the main class to run (e.g `org.company.Main`) | `null` | +| `environmentName` | string | Name of the project spark environment | `"spark-feature-pipeline"` | +| `spark.driver.cores` | number (float) | Number of CPU cores allocated for the driver | `1.0` | +| `spark.driver.memory` | number (int) | Memory allocated for the driver (in MB) | `2048` | +| `spark.executor.instances` | number (int) | Number of executor instances | `1` | +| `spark.executor.cores` | number (float) | Number of CPU cores per executor | `1.0` | +| `spark.executor.memory` | number (int) | Memory allocated per executor (in MB) | `4096` | +| `spark.dynamicAllocation.enabled` | boolean | Enable dynamic allocation of executors | `true` | +| `spark.dynamicAllocation.minExecutors` | number (int) | Minimum number of executors with dynamic allocation | `1` | +| `spark.dynamicAllocation.maxExecutors` | number (int) | Maximum number of executors with dynamic allocation | `2` | +| `spark.dynamicAllocation.initialExecutors` | number (int) | Initial number of executors with dynamic allocation | `1` | +| `spark.blacklist.enabled` | boolean | Whether executor/node blacklisting is enabled | `false` | + +## Accessing project data + +### Read directly from the filesystem (recommended) + +To read a dataset in your project using Spark, use the full filesystem path where the data is stored. For example, to read a CSV file named `data.csv` located in the `Resources` dataset of a project called `my_project`: + +```java +Dataset df = spark.read() + .option("header", "true") // CSV has header + .option("inferSchema", "true") // Infer data types + .csv("/Projects/my_project/Resources/data.csv"); + +df.show(); +``` + +### Additional files + +Different file types can be attached to the spark job and made available in the `/srv/hops/artifacts` folder when the Spark job is started. This configuration is mainly useful when you need to add additional configuration such as jars that needs to be added to the CLASSPATH. + +When reading data in your Spark job it is recommended to use the Spark read API as previously demonstrated, since this reads from the filesystem directly, whereas `Additional files` configuration options will download the files in its entirety and is not a scalable option. + +## API Reference [Jobs](https://docs.hopsworks.ai/hopsworks-api/{{{ hopsworks_version }}}/generated/api/jobs/) diff --git a/docs/user_guides/projects/jupyter/python_notebook.md b/docs/user_guides/projects/jupyter/python_notebook.md index 3412a0d96..409faa6d5 100644 --- a/docs/user_guides/projects/jupyter/python_notebook.md +++ b/docs/user_guides/projects/jupyter/python_notebook.md @@ -5,7 +5,7 @@ Jupyter is provided as a service in Hopsworks, providing the same user experience and features as if run on your laptop. * Supports JupyterLab and the classic Jupyter front-end -* Configured with Python and PySpark kernels +* Configured with Python3, PySpark and Ray kernels ## Step 1: Jupyter dashboard @@ -82,6 +82,17 @@ Start the Jupyter instance by clicking the `Run Jupyter` button.

+## Accessing project data +!!! notice "Recommended approach if `/hopsfs` is mounted" + If your Hopsworks installation is configured to mount the project datasets under `/hopsfs`, which it is in most cases, then please refer to this section. + If the file system is not mounted, then project files can be localized using the [download api](https://docs.hopsworks.ai/hopsworks-api/{{{ hopsworks_version }}}/generated/api/datasets/#download) to localize files in the current working directory. + +### Absolute paths +The project datasets are mounted under `/hopsfs`, so you can access `data.csv` from the `Resources` dataset using `/hopsfs/Resources/data.csv` in your notebook. + +### Relative paths +The notebook's working directory is the folder it is located in. For example, if it is located in the `Resources` dataset, and you have a file named `data.csv` in that dataset, you simply access it using `data.csv`. Also, if you write a local file, for example `output.txt`, it will be saved in the `Resources` dataset. + ## Going Further diff --git a/docs/user_guides/projects/jupyter/ray_notebook.md b/docs/user_guides/projects/jupyter/ray_notebook.md index d6d4eae3e..f008583e1 100644 --- a/docs/user_guides/projects/jupyter/ray_notebook.md +++ b/docs/user_guides/projects/jupyter/ray_notebook.md @@ -139,4 +139,8 @@ In the Ray Dashboard, you can monitor the resources used by code you are runnin Access Ray Dashboard
Access Ray Dashboard for Jupyter Ray session
-

\ No newline at end of file +

+ +## Accessing project data + +The project datasets are mounted under `/home/yarnapp/hopsfs` in the Ray containers, so you can access `data.csv` from the `Resources` dataset using `/home/yarnapp/hopsfs/Resources/data.csv`. diff --git a/docs/user_guides/projects/jupyter/spark_notebook.md b/docs/user_guides/projects/jupyter/spark_notebook.md index c358bee61..689df54ba 100644 --- a/docs/user_guides/projects/jupyter/spark_notebook.md +++ b/docs/user_guides/projects/jupyter/spark_notebook.md @@ -135,6 +135,23 @@ Navigate back to Hopsworks and a Spark session will have appeared, click on the

+## Accessing project data + +### Read directly from the filesystem (recommended) + +To read a dataset in your project using Spark, use the full filesystem path where the data is stored. For example, to read a CSV file named `data.csv` located in the `Resources` dataset of a project called `my_project`: + +```python +df = spark.read.csv("/Projects/my_project/Resources/data.csv", header=True, inferSchema=True) +df.show() +``` + +### Additional files + +Different files can be attached to the jupyter session and made available in the `/srv/hops/artifacts` folder when the PySpark kernel is started. This configuration is mainly useful when you need to add additional configuration such as jars that needs to be added to the CLASSPATH. + +When reading data in your Spark application, it is recommended to use the Spark read API as previously demonstrated, since this reads from the filesystem directly, whereas `Additional files` configuration options will download the files in its entirety and is not a scalable option. + ## Going Further You can learn how to [install a library](../python/python_install.md) so that it can be used in a notebook. diff --git a/mkdocs.yml b/mkdocs.yml index 042c13b34..7bb71e1d9 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -152,11 +152,11 @@ nav: - Run Ray Notebook: user_guides/projects/jupyter/ray_notebook.md - Remote Filesystem Driver: user_guides/projects/jupyter/remote_filesystem_driver.md - Jobs: + - Run Python Job: user_guides/projects/jobs/python_job.md + - Run Jupyter Notebook Job: user_guides/projects/jobs/notebook_job.md - Run PySpark Job: user_guides/projects/jobs/pyspark_job.md - Run Spark Job: user_guides/projects/jobs/spark_job.md - - Run Python Job: user_guides/projects/jobs/python_job.md - Run Ray Job: user_guides/projects/jobs/ray_job.md - - Run Jupyter Notebook Job: user_guides/projects/jobs/notebook_job.md - Scheduling: user_guides/projects/jobs/schedule_job.md - Kubernetes Scheduling: user_guides/projects/scheduling/kube_scheduler.md - Airflow: user_guides/projects/airflow/airflow.md