Oozie & PySpark workflow

This project demonstrates a Oozie workflow with a PySpark action. It assumes that all the PySpark logic is in a Python library that only needs a HiveContext and a date to run. The Python library is distributed to all the workers on the cluster and a pipeline within the library is kicked off daily depending on some data sources.

A Oozie workflow with a PySpark action is non-trivial, because:

Oozie doesn't have native support for a PySpark action.
YARN can kill your Oozie container when running PySpark in a shell.
Distributing eggs with PySpark can be a challenge due to problems with the PYTHON_EGG_CACHE (depending on cluster setup).

Issue 1. is solved by calling spark-submit in a shell-action; 2. by setting the configuration in in the workflow definition; and 3. by setting the PYTHON_EGG_CACHE for the driver in the calling Python script and for the executors in the spark-submit.

Note: this repo will crash (and this time it's not Oozie's fault!): this is a stripped/anonymized version of a workflow. . Change references to namenodes, HDFS, Oozie urls, /user/-stuff, etc.

Submitting a new workflow

Create a Python egg by cd-ing in Python project directory and copying it to dist/ in the Oozie workflow folder:

$ cd python-project/
$ python setup.py bdist_egg
$ cp dist/project-1.0-py3.5.egg ../oozie-workflow/dist/

Run update_and_run.sh to upload oozie-workflow/ to HDFS and run the Oozie coordinator.

Folders

`oozie-workflow/`

Coordinator to run a workflow.

job.properties: Coordinator properties (start & end date, job tracker, etc.)
coordinator.xml: Coordinator setting data dependencies.

workflow.xml: Workflow specification.
bin/: Scripts used in the workflow.
dist/: Folder with Python used in the workflow

`python-project/`

Python project. Project code in pythonproject, egg instructions in setup.py.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
oozie-workflow		oozie-workflow
python-project		python-project
.gitignore		.gitignore
README.md		README.md
update_and_run.sh		update_and_run.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Oozie & PySpark workflow

Submitting a new workflow

Folders

`oozie-workflow/`

`python-project/`

About

Uh oh!

Releases

Packages

Uh oh!

Languages

hgrif/oozie-pyspark-workflow

Folders and files

Latest commit

History

Repository files navigation

Oozie & PySpark workflow

Submitting a new workflow

Folders

oozie-workflow/

python-project/

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

`oozie-workflow/`

`python-project/`

Packages