This project demonstrates a Oozie workflow with a PySpark action. It assumes that all the PySpark logic is in a Python library that only needs a HiveContext and a date to run. The Python library is distributed to all the workers on the cluster and a pipeline within the library is kicked off daily depending on some data sources.
A Oozie workflow with a PySpark action is non-trivial, because:
- Oozie doesn't have native support for a PySpark action.
 - YARN can kill your Oozie container when running PySpark in a shell.
 - Distributing eggs with PySpark can be a challenge due to problems with the 
PYTHON_EGG_CACHE(depending on cluster setup). 
Issue 1. is solved by calling spark-submit in a shell-action; 2. by setting the configuration in in the workflow definition; and 3. by setting the PYTHON_EGG_CACHE for the driver in the calling Python script and for the executors in the spark-submit.
Note: this repo will crash (and this time it's not Oozie's fault!): this is a stripped/anonymized version of a workflow. . Change references to namenodes, HDFS, Oozie urls, /user/-stuff, etc.
Create a Python egg by cd-ing in Python project directory and copying it to dist/ in the Oozie workflow folder:
$ cd python-project/
$ python setup.py bdist_egg
$ cp dist/project-1.0-py3.5.egg ../oozie-workflow/dist/
Run update_and_run.sh to upload oozie-workflow/ to HDFS and run the Oozie coordinator.
Coordinator to run a workflow.
job.properties: Coordinator properties (start & end date, job tracker, etc.)coordinator.xml: Coordinator setting data dependencies.
workflow.xml: Workflow specification.bin/: Scripts used in the workflow.dist/: Folder with Python used in the workflow
Python project. Project code in pythonproject, egg instructions in setup.py.