-
Notifications
You must be signed in to change notification settings - Fork 34
Cannot specify Python version when launching PySpark jobs #73
Comments
|
@mgummelt These seems like a somewhat backwards way to use spark. Typically one should have control over these variables when starting jobs. I think this would mean that spark applications are not easy to port to DC/OS |
Can you give me an example of how you would set these outside of DC/OS? When submitting in cluster mode, I'm not aware of any other system (YARN, Standalone) that forwards along those environment variables to the driver. |
It is common to switch your version of python using the PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON environment variables and I use this often. Its most common when testing between python2 and python3 however in one specific case I have seen 3 different versions of Anaconda Python installed on a Hadoop cluster with different dependancies and custom modules set up. The spark documentation also tells that these variables should be controllable when using spark-submit - http://spark.apache.org/docs/latest/configuration.html#environment-variables - But its a bit confusing as yarn seems to have about three hundred submission modes. From the cloudera is the following passage:
http://blog.cloudera.com/blog/2015/09/how-to-prepare-your-apache-hadoop-cluster-for-pyspark-jobs/ Test on AWS EMRtest.py
Run without VARS set
Run with VARS set
|
@mgummelt - could we reopen this issue? |
I came across this recently when using AWS EMR and was able to get set up with a Python 3.6.8 driver to match the version of my worker nodes with the following steps after SSHing into the master node: ` Update package managersudo yum update Install Anaconda - you may need to close and reopen your shell after thiswget https://repo.continuum.io/archive/Anaconda3-2019.10-Linux-x86_64.sh Create virtual environmentconda create -n py368 python=3.6.8 Install Python packagespip install --user jupyter Create notebook kernelpython -m ipykernel install --user --name py368 --display-name "Python 3.6.8" Pull reposudo yum install git PYSPARK Configurationexport PYTHONPATH="/home/hadoop/.local/lib/python3.6/site-packages:$PYTHONPATH" |
Whilst trying to use python3 as the PySpark driver I have found that PYSPARK_PYTHON & PYSPARK_DRIVER_PYTHON seem to be ignored when launching jobs using the dcos spark tool.
According to the spark documentation "SPARK_HOME/conf/spark-env.sh can be used to set various variables when launching spark jobs on mesos. - http://spark.apache.org/docs/latest/configuration.html#environment-variables
I have copied
~/.dcos/spark/dist/spark-2.0.0/conf/spark-env.sh.template
to
~/.dcos/spark/dist/spark-2.0.0/conf/spark-env.sh
and added the lines:
I have also tried the following in the shell before running jobs:
Also putting these directly in the spark-submit shell script does not work which brings me to the conclusion that these environment variables are being stripped out somewhere. I don't see any errors anywhere.
I'm testing the python version with:
The text was updated successfully, but these errors were encountered: