I have an ipython notebook that contains some PySpark code on a cluster. Currently we are using oozie to run these notebooks on Hadoop via HUE. The setup feels less than ideal and we were wondering if there is an alternative.
We first convert the .ipynb file into a .py file and move it to hdfs. Along with this file we also create a .sh file that calls the python file. The contents are similar to:
#!/bin/sh
set -e
[ -r /usr/local/virtualenv/pyspark/bin/activate ] &&
source /usr/local/virtualenv/pyspark/bin/activate
spark-submit --master yarn-client --<setting> <setting_val> <filename>.py
Next we have Oozie point to this .sh file. This flow feels a bit cumbersome and Oozie doesn't give us great insight in what goes wrong when something fails. We do like it how Oozie knows how to run tasks in parallell or serial depending on your configuration.
Is there a better, smoother way of just scheduling pyspark notebooks?
Aucun commentaire:
Enregistrer un commentaire