-
Notifications
You must be signed in to change notification settings - Fork 355
GetStarted_python
Before you begin, make sure you have exported or set the environment variables required for caffeonspark in the GetStarted guides (GetStarted_local_osx) or (GetStarted_local) or (GetStarted_yarn)
You could install Python like given in section "Fresh python install" below or use the installation already in your machine. These are the dependent python module required in your Python installation.
Cython>=0.19.2
numpy>=1.7.1
scipy>=0.13.2
scikit-image>=0.9.3
matplotlib>=1.3.1
ipython>=3.0.0
h5py>=2.2.0
leveldb>=0.191
networkx>=1.8.1
nose>=1.3.0
pandas>=0.12.0
python-dateutil>=1.4,<2
protobuf>=2.5.0
python-gflags>=2.0
pyyaml>=3.10
Pillow>=2.3.0
six>=1.1.0
ipython[notebook]
numpy
matplotlib
pandas
p4j
Once you have installed these python modules, export your IPYTHON_ROOT to your python installation and make a Python.zip file and other exports as below:
export IPYTHON_ROOT=<your python installation>
pushd "${IPYTHON_ROOT}" >/dev/null
zip -r Python.zip *
popd >/dev/null
pushd "${IPYTHON_ROOT}/.." >/dev/null
export PATH="${IPYTHON_ROOT}/bin:\${PATH}"
export PYSPARK_PYTHON=${IPYTHON_ROOT}/bin/python
export PYTHONPATH=${CAFFE_ON_SPARK}/caffe-grid/target/caffeonsparkpythonapi.zip:${SPARK_HOME}/python/lib/pyspark.zip
export PATH=${SPARK_HOME}/bin:${IPYTHON_ROOT}/bin/:$PATH
Skip this step if you pointed IPYTHON_ROOT to your own installation above.
export IPYTHON_ROOT=~/Python2.7.10 #Change this directory to install elsewhere.
curl -O https://www.python.org/ftp/python/2.7.10/Python-2.7.10.tgz
tar -xvf Python-2.7.10.tgz
rm Python-2.7.10.tgz
pushd Python-2.7.10 >/dev/null
./configure --prefix="${IPYTHON_ROOT}"
make
make install
popd >/dev/null
rm -rf Python-2.7.10
pushd "${IPYTHON_ROOT}" >/dev/null
curl -O https://bootstrap.pypa.io/get-pip.py
bin/python get-pip.py
rm get-pip.py
bin/pip install <All dependencies listed in the first section>
popd >/dev/null
pushd ${CAFFE_ON_SPARK}/data/
unzip ${CAFFE_ON_SPARK}/caffe-grid/target/caffeonsparkpythonapi.zip
spark-submit --master ${MASTER_URL} --driver-library-path "${CAFFE_ON_SPARK}/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar" \
--driver-class-path "${CAFFE_ON_SPARK}/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar" \
--conf spark.cores.max=${TOTAL_CORES} \
--conf spark.driver.extraLibraryPath="${LD_LIBRARY_PATH}" --conf spark.executorEnv.LD_LIBRARY_PATH="${LD_LIBRARY_PATH}" \
--py-files ${CAFFE_ON_SPARK}/caffe-grid/target/caffeonsparkpythonapi.zip \
--files ${CAFFE_ON_SPARK}/data/caffe/_caffe.so,${CAFFE_ON_SPARK}/data/caffe/lenet_memory_solver.prototxt,${CAFFE_ON_SPARK}/data/caffe/lenet_memory_train_test.prototxt \
--jars "${CAFFE_ON_SPARK}/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar" \
--conf spark.pythonargs="-conf lenet_memory_solver.prototxt -model file:///tmp/lenet.model -features accuracy,ip1,ip2 -label label -output file:///tmp/output -devices 1 -outputFormat json" \
examples/MultiClassLogisticRegression.py
pushd ${CAFFE_ON_SPARK}/data/
unzip ${CAFFE_ON_SPARK}/caffe-grid/target/caffeonsparkpythonapi.zip
IPYTHON=1 pyspark --master ${MASTER_URL} --driver-library-path "${CAFFE_ON_SPARK}/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar" \
--driver-class-path "${CAFFE_ON_SPARK}/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar" \
--conf spark.cores.max=${TOTAL_CORES} \
--conf spark.driver.extraLibraryPath="${LD_LIBRARY_PATH}" --conf spark.executorEnv.LD_LIBRARY_PATH="${LD_LIBRARY_PATH}" \
--py-files ${CAFFE_ON_SPARK}/caffe-grid/target/caffeonsparkpythonapi.zip \
--files ${CAFFE_ON_SPARK}/data/caffe/_caffe.so,${CAFFE_ON_SPARK}/data/caffe/lenet_memory_solver.prototxt,${CAFFE_ON_SPARK}/data/caffe/lenet_memory_train_test.prototxt \
--jars "${CAFFE_ON_SPARK}/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar"
You can also run simple LogisticRegression on the shell prompt like below.
from com.yahoo.ml.caffe.RegisterContext import registerContext
from com.yahoo.ml.caffe.RegisterContext import registerSQLContext
registerContext(sc)
registerSQLContext(sqlContext)
cos=CaffeOnSpark(sc,sqlContext)
cfg=Config(sc)
cfg.protoFile='~/CaffeOnSpark/data/lenet_dataframe_solver.prototxt'
cfg.modelPath = 'file:/tmp/lenet.model'
cfg.label = 'label'
cfg.outputPath = 'file:outputlenet'
cfg.devices = 1
cfg.outputFormat = 'json'
cfg.isTest = True
cfg.clusterSize = 1
dl_train_source = DataSource(sc).getSource(cfg,True)
#Train
cos.train(dl_train_source)
lr_raw_source = DataSource(sc).getSource(cfg,False)
#Extract features
extracted_df = cos.features(lr_raw_source)
# Do multiclass LogisticRegression
data = extracted_df.map(lambda row: LabeledPoint(row.label[0], Vectors.dense(row.ip1)))
lr = LogisticRegressionWithLBFGS.train(data, numClasses=10, iterations=10)
predictions = lr.predict(data.map(lambda pt : pt.features))
This step is required, only if you want to run the sample notebook given with the code under {CAFFE_ON_SPARK}/data/examples/DLDemo.ipyb. Skip to the next section in case you don't want to run the demo notebook.
rm -rf ${CAFFE_ON_SPARK}/data/mnist_train_dataframe
spark-submit --master ${MASTER_URL} \
--conf spark.cores.max=${TOTAL_CORES} \
--conf spark.driver.extraLibraryPath="${LD_LIBRARY_PATH}" \
--conf spark.executorEnv.LD_LIBRARY_PATH="${LD_LIBRARY_PATH}" \
--class com.yahoo.ml.caffe.tools.LMDB2DataFrame \
${CAFFE_ON_SPARK}/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar \
-imageRoot file:${CAFFE_ON_SPARK}/data/mnist_train_lmdb \
-lmdb_partitions 10 \
-outputFormat parquet \
-output file:${CAFFE_ON_SPARK}/data/mnist_train_dataframe
rm -rf ${CAFFE_ON_SPARK}/data/mnist_test_dataframe
spark-submit --master ${MASTER_URL} \
--conf spark.cores.max=${TOTAL_CORES} \
--conf spark.driver.extraLibraryPath="${LD_LIBRARY_PATH}" \
--conf spark.executorEnv.LD_LIBRARY_PATH="${LD_LIBRARY_PATH}" \
--class com.yahoo.ml.caffe.tools.LMDB2DataFrame \
${CAFFE_ON_SPARK}/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar \
-imageRoot file:${CAFFE_ON_SPARK}/data/mnist_test_lmdb \
-lmdb_partitions 10 \
-outputFormat parquet \
-output file:${CAFFE_ON_SPARK}/data/mnist_test_dataframe
Make sure that you ${CAFFE_ON_SPARK}/data/caffe/lenet_dataframe_train_test.prototxt is updated to point to ${CAFFE_ON_SPARK}/data/mnist_train_dataframe for training and ${CAFFE_ON_SPARK}/data/mnist_test_dataframe for test.
export IPYTHON_OPTS="notebook --no-browser --ip=`hostname`"
pushd ${CAFFE_ON_SPARK}/data/
unzip ${CAFFE_ON_SPARK}/caffe-grid/target/caffeonsparkpythonapi.zip
IPYTHON=1 pyspark --master ${MASTER_URL} --driver-library-path "${CAFFE_ON_SPARK}/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar" \
--driver-class-path "${CAFFE_ON_SPARK}/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar" \
--conf spark.driver.extraLibraryPath="${LD_LIBRARY_PATH}" --conf spark.executorEnv.LD_LIBRARY_PATH="${LD_LIBRARY_PATH}" \
--py-files ${CAFFE_ON_SPARK}/caffe-grid/target/caffeonsparkpythonapi.zip \
--conf spark.cores.max=${TOTAL_CORES} \
--files ${CAFFE_ON_SPARK}/data/caffe/_caffe.so,${CAFFE_ON_SPARK}/data/caffe/lenet_dataframe_solver.prototxt,${CAFFE_ON_SPARK}/data/caffe/lenet_dataframe_train_test.prototxt \
--jars "${CAFFE_ON_SPARK}/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar"
When you run the above, the console will output a url, which you should copy on your browser. There you need to click examples/DLDemo.ipynb. When executing the notebook, replace the path of various files like lenet_memory_solver.txt, mnist_dataframe_test with the full path of those files on your system.