Skip to content

Commit 410a3a7

Browse files
authored
[HWORKS-2190] Describe the job configuration and how to access the filesystem in the docs (#471) (#475)
1 parent a1c84dc commit 410a3a7

File tree

9 files changed

+187
-18
lines changed

9 files changed

+187
-18
lines changed

docs/user_guides/projects/jobs/notebook_job.md

Lines changed: 29 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -82,7 +82,7 @@ It is possible to also set following configuration settings for a `PYTHON` job.
8282
* `Environment`: The python environment to use
8383
* `Container memory`: The amount of memory in MB to be allocated to the Jupyter Notebook script
8484
* `Container cores`: The number of cores to be allocated for the Jupyter Notebook script
85-
* `Additional files`: List of files that will be locally accessible by the application
85+
* `Additional files`: List of files that will be locally accessible in the working directory of the application. Only recommended to use if project datasets are not mounted under `/hopsfs`.
8686
You can always modify the arguments in the job settings.
8787

8888
<p align="center">
@@ -142,7 +142,7 @@ In this snippet we get the `JobsApi` object to get the default job configuration
142142

143143
```python
144144

145-
jobs_api = project.get_jobs_api()
145+
jobs_api = project.get_job_api()
146146

147147
notebook_job_config = jobs_api.get_configuration("PYTHON")
148148

@@ -166,7 +166,33 @@ In this code snippet, we execute the job with arguments and wait until it reache
166166
execution = job.run(args='-p a 2 -p b 5', await_termination=True)
167167
```
168168

169-
### API Reference
169+
## Configuration
170+
The following table describes the JSON payload returned by `jobs_api.get_configuration("PYTHON")`
171+
172+
| Field | Type | Description | Default |
173+
|-------------------------|----------------|------------------------------------------------------|--------------------------|
174+
| `type` | string | Type of the job configuration | `"pythonJobConfiguration"` |
175+
| `appPath` | string | Project path to notebook (e.g `Resources/foo.ipynb`) | `null` |
176+
| `environmentName` | string | Name of the python environment | `"pandas-training-pipeline"` |
177+
| `resourceConfig.cores` | number (float) | Number of CPU cores to be allocated | `1.0` |
178+
| `resourceConfig.memory` | number (int) | Number of MBs to be allocated | `2048` |
179+
| `resourceConfig.gpus` | number (int) | Number of GPUs to be allocated | `0` |
180+
| `logRedirection` | boolean | Whether logs are redirected | `true` |
181+
| `jobType` | string | Type of job | `"PYTHON"` |
182+
183+
184+
## Accessing project data
185+
!!! notice "Recommended approach if `/hopsfs` is mounted"
186+
If your Hopsworks installation is configured to mount the project datasets under `/hopsfs`, which it is in most cases, then please refer to this section instead of the `Additional files` property to reference file resources.
187+
188+
### Absolute paths
189+
The project datasets are mounted under `/hopsfs`, so you can access `data.csv` from the `Resources` dataset using `/hopsfs/Resources/data.csv` in your notebook.
190+
191+
### Relative paths
192+
The notebook's working directory is the folder it is located in. For example, if it is located in the `Resources` dataset, and you have a file named `data.csv` in that dataset, you simply access it using `data.csv`. Also, if you write a local file, for example `output.txt`, it will be saved in the `Resources` dataset.
193+
194+
195+
## API Reference
170196

171197
[Jobs](https://docs.hopsworks.ai/hopsworks-api/{{{ hopsworks_version }}}/generated/api/jobs/)
172198

docs/user_guides/projects/jobs/pyspark_job.md

Lines changed: 41 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ description: Documentation on how to configure and execute a PySpark job on Hops
88

99
All members of a project in Hopsworks can launch the following types of applications through a project's Jobs service:
1010

11-
- Python (*Hopsworks Enterprise only*)
11+
- Python
1212
- Apache Spark
1313

1414
Launching a job of any type is very similar process, what mostly differs between job types is
@@ -179,7 +179,7 @@ In this snippet we get the `JobsApi` object to get the default job configuration
179179

180180
```python
181181

182-
jobs_api = project.get_jobs_api()
182+
jobs_api = project.get_job_api()
183183

184184
spark_config = jobs_api.get_configuration("PYSPARK")
185185

@@ -211,7 +211,45 @@ print(f_err.read())
211211

212212
```
213213

214-
### API Reference
214+
## Configuration
215+
The following table describes the JSON payload returned by `jobs_api.get_configuration("PYSPARK")`
216+
217+
| Field | Type | Description | Default |
218+
| ------------------------------------------ | -------------- |-----------------------------------------------------| -------------------------- |
219+
| `type` | string | Type of the job configuration | `"sparkJobConfiguration"` |
220+
| `appPath` | string | Project path to script (e.g `Resources/foo.py`) | `null` |
221+
| `environmentName` | string | Name of the project spark environment | `"spark-feature-pipeline"` |
222+
| `spark.driver.cores` | number (float) | Number of CPU cores allocated for the driver | `1.0` |
223+
| `spark.driver.memory` | number (int) | Memory allocated for the driver (in MB) | `2048` |
224+
| `spark.executor.instances` | number (int) | Number of executor instances | `1` |
225+
| `spark.executor.cores` | number (float) | Number of CPU cores per executor | `1.0` |
226+
| `spark.executor.memory` | number (int) | Memory allocated per executor (in MB) | `4096` |
227+
| `spark.dynamicAllocation.enabled` | boolean | Enable dynamic allocation of executors | `true` |
228+
| `spark.dynamicAllocation.minExecutors` | number (int) | Minimum number of executors with dynamic allocation | `1` |
229+
| `spark.dynamicAllocation.maxExecutors` | number (int) | Maximum number of executors with dynamic allocation | `2` |
230+
| `spark.dynamicAllocation.initialExecutors` | number (int) | Initial number of executors with dynamic allocation | `1` |
231+
| `spark.blacklist.enabled` | boolean | Whether executor/node blacklisting is enabled | `false` |
232+
233+
234+
## Accessing project data
235+
236+
### Read directly from the filesystem (recommended)
237+
238+
To read a dataset in your project using Spark, use the full filesystem path where the data is stored. For example, to read a CSV file named `data.csv` located in the `Resources` dataset of a project called `my_project`:
239+
240+
```python
241+
df = spark.read.csv("/Projects/my_project/Resources/data.csv", header=True, inferSchema=True)
242+
df.show()
243+
```
244+
245+
### Additional files
246+
247+
Different file types can be attached to the spark job and made available in the `/srv/hops/artifacts` folder when the PySpark job is started. This configuration is mainly useful when you need to add additional setup, such as jars that needs to be added to the CLASSPATH.
248+
249+
When reading data in your Spark job it is recommended to use the Spark read API as previously demonstrated, since this reads from the filesystem directly, whereas `Additional files` configuration options will download the files in its entirety and is not a scalable option.
250+
251+
252+
## API Reference
215253

216254
[Jobs](https://docs.hopsworks.ai/hopsworks-api/{{{ hopsworks_version }}}/generated/api/jobs/)
217255

docs/user_guides/projects/jobs/python_job.md

Lines changed: 30 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -81,7 +81,8 @@ It is possible to also set following configuration settings for a `PYTHON` job.
8181
* `Environment`: The python environment to use
8282
* `Container memory`: The amount of memory in MB to be allocated to the Python script
8383
* `Container cores`: The number of cores to be allocated for the Python script
84-
* `Additional files`: List of files that will be locally accessible by the application
84+
* `Additional files`: List of files that will be locally accessible in the working directory of the application. Only recommended to use if project datasets are not mounted under `/hopsfs`.
85+
You can always modify the arguments in the job settings.
8586

8687
<p align="center">
8788
<figure>
@@ -129,7 +130,7 @@ In this snippet we get the `JobsApi` object to get the default job configuration
129130

130131
```python
131132

132-
jobs_api = project.get_jobs_api()
133+
jobs_api = project.get_job_api()
133134

134135
py_job_config = jobs_api.get_configuration("PYTHON")
135136

@@ -163,7 +164,33 @@ print(f_err.read())
163164

164165
```
165166

166-
### API Reference
167+
## Configuration
168+
The following table describes the JSON payload returned by `jobs_api.get_configuration("PYTHON")`
169+
170+
| Field | Type | Description | Default |
171+
|-------------------------|----------------|-------------------------------------------------|--------------------------|
172+
| `type` | string | Type of the job configuration | `"pythonJobConfiguration"` |
173+
| `appPath` | string | Project path to script (e.g `Resources/foo.py`) | `null` |
174+
| `environmentName` | string | Name of the project python environment | `"pandas-training-pipeline"` |
175+
| `resourceConfig.cores` | number (float) | Number of CPU cores to be allocated | `1.0` |
176+
| `resourceConfig.memory` | number (int) | Number of MBs to be allocated | `2048` |
177+
| `resourceConfig.gpus` | number (int) | Number of GPUs to be allocated | `0` |
178+
| `logRedirection` | boolean | Whether logs are redirected | `true` |
179+
| `jobType` | string | Type of job | `"PYTHON"` |
180+
181+
182+
## Accessing project data
183+
!!! notice "Recommended approach if `/hopsfs` is mounted"
184+
If your Hopsworks installation is configured to mount the project datasets under `/hopsfs`, which it is in most cases, then please refer to this section instead of the `Additional files` property to reference file resources.
185+
186+
### Absolute paths
187+
The project datasets are mounted under `/hopsfs`, so you can access `data.csv` from the `Resources` dataset using `/hopsfs/Resources/data.csv` in your script.
188+
189+
### Relative paths
190+
The script's working directory is the folder it is located in. For example, if it is located in the `Resources` dataset, and you have a file named `data.csv` in that dataset, you simply access it using `data.csv`. Also, if you write a local file, for example `output.txt`, it will be saved in the `Resources` dataset.
191+
192+
193+
## API Reference
167194

168195
[Jobs](https://docs.hopsworks.ai/hopsworks-api/{{{ hopsworks_version }}}/generated/api/jobs/)
169196

docs/user_guides/projects/jobs/ray_job.md

Lines changed: 8 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ description: Documentation on how to configure and execute a Ray job on Hopswork
88

99
All members of a project in Hopsworks can launch the following types of applications through a project's Jobs service:
1010

11-
- Python (*Hopsworks Enterprise only*)
11+
- Python
1212
- Apache Spark
1313
- Ray
1414

@@ -168,7 +168,7 @@ In this snippet we get the `JobsApi` object to get the default job configuration
168168

169169
```python
170170

171-
jobs_api = project.get_jobs_api()
171+
jobs_api = project.get_job_api()
172172

173173
ray_config = jobs_api.get_configuration("RAY")
174174

@@ -203,7 +203,12 @@ print(f_err.read())
203203

204204
```
205205

206-
### API Reference
206+
## Accessing project data
207+
208+
The project datasets are mounted under `/home/yarnapp/hopsfs`, so you can access `data.csv` from the `Resources` dataset using `/home/yarnapp/hopsfs/Resources/data.csv` in your script.
209+
210+
211+
## API Reference
207212

208213
[Jobs](https://docs.hopsworks.ai/hopsworks-api/{{{ hopsworks_version }}}/generated/api/jobs/)
209214

docs/user_guides/projects/jobs/spark_job.md

Lines changed: 43 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -183,7 +183,7 @@ In this snippet we get the `JobsApi` object to get the default job configuration
183183

184184
```python
185185

186-
jobs_api = project.get_jobs_api()
186+
jobs_api = project.get_job_api()
187187

188188
spark_config = jobs_api.get_configuration("SPARK")
189189

@@ -212,7 +212,48 @@ print(f_err.read())
212212

213213
```
214214

215-
### API Reference
215+
## Configuration
216+
The following table describes the JSON payload returned by `jobs_api.get_configuration("SPARK")`
217+
218+
| Field | Type | Description | Default |
219+
|--------------------------------------------| -------------- |---------------------------------------------------------| -------------------------- |
220+
| `type` | string | Type of the job configuration | `"sparkJobConfiguration"` |
221+
| `appPath` | string | Project path to spark program (e.g `Resources/foo.jar`) | `null` |
222+
| `mainClass` | string | Name of the main class to run (e.g `org.company.Main`) | `null` |
223+
| `environmentName` | string | Name of the project spark environment | `"spark-feature-pipeline"` |
224+
| `spark.driver.cores` | number (float) | Number of CPU cores allocated for the driver | `1.0` |
225+
| `spark.driver.memory` | number (int) | Memory allocated for the driver (in MB) | `2048` |
226+
| `spark.executor.instances` | number (int) | Number of executor instances | `1` |
227+
| `spark.executor.cores` | number (float) | Number of CPU cores per executor | `1.0` |
228+
| `spark.executor.memory` | number (int) | Memory allocated per executor (in MB) | `4096` |
229+
| `spark.dynamicAllocation.enabled` | boolean | Enable dynamic allocation of executors | `true` |
230+
| `spark.dynamicAllocation.minExecutors` | number (int) | Minimum number of executors with dynamic allocation | `1` |
231+
| `spark.dynamicAllocation.maxExecutors` | number (int) | Maximum number of executors with dynamic allocation | `2` |
232+
| `spark.dynamicAllocation.initialExecutors` | number (int) | Initial number of executors with dynamic allocation | `1` |
233+
| `spark.blacklist.enabled` | boolean | Whether executor/node blacklisting is enabled | `false` |
234+
235+
## Accessing project data
236+
237+
### Read directly from the filesystem (recommended)
238+
239+
To read a dataset in your project using Spark, use the full filesystem path where the data is stored. For example, to read a CSV file named `data.csv` located in the `Resources` dataset of a project called `my_project`:
240+
241+
```java
242+
Dataset<Row> df = spark.read()
243+
.option("header", "true") // CSV has header
244+
.option("inferSchema", "true") // Infer data types
245+
.csv("/Projects/my_project/Resources/data.csv");
246+
247+
df.show();
248+
```
249+
250+
### Additional files
251+
252+
Different file types can be attached to the spark job and made available in the `/srv/hops/artifacts` folder when the Spark job is started. This configuration is mainly useful when you need to add additional configuration such as jars that needs to be added to the CLASSPATH.
253+
254+
When reading data in your Spark job it is recommended to use the Spark read API as previously demonstrated, since this reads from the filesystem directly, whereas `Additional files` configuration options will download the files in its entirety and is not a scalable option.
255+
256+
## API Reference
216257

217258
[Jobs](https://docs.hopsworks.ai/hopsworks-api/{{{ hopsworks_version }}}/generated/api/jobs/)
218259

docs/user_guides/projects/jupyter/python_notebook.md

Lines changed: 12 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@
55
Jupyter is provided as a service in Hopsworks, providing the same user experience and features as if run on your laptop.
66

77
* Supports JupyterLab and the classic Jupyter front-end
8-
* Configured with Python and PySpark kernels
8+
* Configured with Python3, PySpark and Ray kernels
99

1010
## Step 1: Jupyter dashboard
1111

@@ -82,6 +82,17 @@ Start the Jupyter instance by clicking the `Run Jupyter` button.
8282
</figure>
8383
</p>
8484

85+
## Accessing project data
86+
!!! notice "Recommended approach if `/hopsfs` is mounted"
87+
If your Hopsworks installation is configured to mount the project datasets under `/hopsfs`, which it is in most cases, then please refer to this section.
88+
If the file system is not mounted, then project files can be localized using the [download api](https://docs.hopsworks.ai/hopsworks-api/{{{ hopsworks_version }}}/generated/api/datasets/#download) to localize files in the current working directory.
89+
90+
### Absolute paths
91+
The project datasets are mounted under `/hopsfs`, so you can access `data.csv` from the `Resources` dataset using `/hopsfs/Resources/data.csv` in your notebook.
92+
93+
### Relative paths
94+
The notebook's working directory is the folder it is located in. For example, if it is located in the `Resources` dataset, and you have a file named `data.csv` in that dataset, you simply access it using `data.csv`. Also, if you write a local file, for example `output.txt`, it will be saved in the `Resources` dataset.
95+
8596

8697
## Going Further
8798

docs/user_guides/projects/jupyter/ray_notebook.md

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -139,4 +139,8 @@ In the Ray Dashboard, you can monitor the resources used by code you are runnin
139139
<img src="../../../../assets/images/guides/jupyter/ray_jupyter_notebook_session.png" alt="Access Ray Dashboard">
140140
<figcaption>Access Ray Dashboard for Jupyter Ray session</figcaption>
141141
</figure>
142-
</p>
142+
</p>
143+
144+
## Accessing project data
145+
146+
The project datasets are mounted under `/home/yarnapp/hopsfs` in the Ray containers, so you can access `data.csv` from the `Resources` dataset using `/home/yarnapp/hopsfs/Resources/data.csv`.

docs/user_guides/projects/jupyter/spark_notebook.md

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -135,6 +135,23 @@ Navigate back to Hopsworks and a Spark session will have appeared, click on the
135135
</figure>
136136
</p>
137137

138+
## Accessing project data
139+
140+
### Read directly from the filesystem (recommended)
141+
142+
To read a dataset in your project using Spark, use the full filesystem path where the data is stored. For example, to read a CSV file named `data.csv` located in the `Resources` dataset of a project called `my_project`:
143+
144+
```python
145+
df = spark.read.csv("/Projects/my_project/Resources/data.csv", header=True, inferSchema=True)
146+
df.show()
147+
```
148+
149+
### Additional files
150+
151+
Different files can be attached to the jupyter session and made available in the `/srv/hops/artifacts` folder when the PySpark kernel is started. This configuration is mainly useful when you need to add additional configuration such as jars that needs to be added to the CLASSPATH.
152+
153+
When reading data in your Spark application, it is recommended to use the Spark read API as previously demonstrated, since this reads from the filesystem directly, whereas `Additional files` configuration options will download the files in its entirety and is not a scalable option.
154+
138155
## Going Further
139156

140157
You can learn how to [install a library](../python/python_install.md) so that it can be used in a notebook.

mkdocs.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -152,11 +152,11 @@ nav:
152152
- Run Ray Notebook: user_guides/projects/jupyter/ray_notebook.md
153153
- Remote Filesystem Driver: user_guides/projects/jupyter/remote_filesystem_driver.md
154154
- Jobs:
155+
- Run Python Job: user_guides/projects/jobs/python_job.md
156+
- Run Jupyter Notebook Job: user_guides/projects/jobs/notebook_job.md
155157
- Run PySpark Job: user_guides/projects/jobs/pyspark_job.md
156158
- Run Spark Job: user_guides/projects/jobs/spark_job.md
157-
- Run Python Job: user_guides/projects/jobs/python_job.md
158159
- Run Ray Job: user_guides/projects/jobs/ray_job.md
159-
- Run Jupyter Notebook Job: user_guides/projects/jobs/notebook_job.md
160160
- Scheduling: user_guides/projects/jobs/schedule_job.md
161161
- Kubernetes Scheduling: user_guides/projects/scheduling/kube_scheduler.md
162162
- Airflow: user_guides/projects/airflow/airflow.md

0 commit comments

Comments
 (0)