Cannot specify Python version when launching PySpark jobs #73

mooperd · 2016-10-18T10:36:16Z

Whilst trying to use python3 as the PySpark driver I have found that PYSPARK_PYTHON & PYSPARK_DRIVER_PYTHON seem to be ignored when launching jobs using the dcos spark tool.

According to the spark documentation "SPARK_HOME/conf/spark-env.sh can be used to set various variables when launching spark jobs on mesos. - http://spark.apache.org/docs/latest/configuration.html#environment-variables

I have copied
~/.dcos/spark/dist/spark-2.0.0/conf/spark-env.sh.template
to
~/.dcos/spark/dist/spark-2.0.0/conf/spark-env.sh

and added the lines:

export PYSPARK_PYTHON = python3
export PYSPARK_DRIVER_PYTHON = python3

I have also tried the following in the shell before running jobs:

export PYSPARK_PYTHON = python3
export PYSPARK_DRIVER_PYTHON = python3

Also putting these directly in the spark-submit shell script does not work which brings me to the conclusion that these environment variables are being stripped out somewhere. I don't see any errors anywhere.

I'm testing the python version with:

version = sys.version
log4jLogger = sc._jvm.org.apache.log4j
LOGGER = log4jLogger.LogManager.getLogger(__name__)
LOGGER.info("pyspark script logger initialized")
LOGGER.info("Python Version: " + version)

The text was updated successfully, but these errors were encountered:

mgummelt · 2016-10-19T17:09:49Z

~/.dcos/spark/dist contains your local distribution of Spark, but it has no effect on the driver nor executors, which all run inside the docker image in the cluster. You'd have to modify spark-env.sh in the docker image.

mooperd · 2016-10-19T19:16:29Z

@mgummelt These seems like a somewhat backwards way to use spark. Typically one should have control over these variables when starting jobs.

I think this would mean that spark applications are not easy to port to DC/OS

mgummelt · 2016-10-19T19:52:40Z

Can you give me an example of how you would set these outside of DC/OS?

When submitting in cluster mode, I'm not aware of any other system (YARN, Standalone) that forwards along those environment variables to the driver.

mooperd · 2016-10-19T20:55:27Z

It is common to switch your version of python using the PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON environment variables and I use this often. Its most common when testing between python2 and python3 however in one specific case I have seen 3 different versions of Anaconda Python installed on a Hadoop cluster with different dependancies and custom modules set up.

The spark documentation also tells that these variables should be controllable when using spark-submit - http://spark.apache.org/docs/latest/configuration.html#environment-variables - But its a bit confusing as yarn seems to have about three hundred submission modes.

From the cloudera is the following passage:

In the best possible world, you have a good relationship with your local sysadmin and they are able and willing to set up a virtualenv or install the Anaconda distribution of Python on every node of your cluster, with your required dependencies. If you are a data scientist responsible for administering your own cluster, you may need to get creative about setting up your required Python environment on your cluster. If you have sysadmin or devops support for your cluster, use it!

http://blog.cloudera.com/blog/2015/09/how-to-prepare-your-apache-hadoop-cluster-for-pyspark-jobs/

Test on AWS EMR

test.py

import sys
from pyspark import SparkContext

sc = SparkContext()
version = sys.version
log4jLogger = sc._jvm.org.apache.log4j
LOGGER = log4jLogger.LogManager.getLogger(__name__)
LOGGER.info("pyspark script logger initialized")
LOGGER.info("Python Version: " + version)
sc.stop()
exit()

Run without VARS set

[hadoop@ip-10-141-1-236 ]$ spark-submit test.py
</snip>
16/10/19 20:47:16 INFO __main__: Python Version: 2.7.10 (default, Jul 20 2016, 20:53:27)
</snip>

Run with VARS set

[hadoop@ip-10-141-1-236 ]$ export PYSPARK_PYTHON=python3.4
[hadoop@ip-10-141-1-236 ]$ export PYSPARK_DRIVER_PYTHON=python3.4
[hadoop@ip-10-141-1-236 ]$ spark-submit test.py
</snip>
16/10/19 20:49:47 INFO __main__: Python Version: 3.4.3 (default, Jul 20 2016, 21:31:36) 
</snip>

mooperd · 2016-10-21T19:57:19Z

@mgummelt - could we reopen this issue?

jstremme · 2019-12-03T20:36:15Z

I came across this recently when using AWS EMR and was able to get set up with a Python 3.6.8 driver to match the version of my worker nodes with the following steps after SSHing into the master node:

`

Update package manager

sudo yum update

Install Anaconda - you may need to close and reopen your shell after this

wget https://repo.continuum.io/archive/Anaconda3-2019.10-Linux-x86_64.sh
sh Anaconda3-2019.10-Linux-x86_64.sh

Create virtual environment

conda create -n py368 python=3.6.8
source activate py368

Install Python packages

pip install --user jupyter
pip install --user ipython
pip install --user ipykernel
pip install --user numpy
pip install --user pandas
pip install --user matplotlib
pip install --user scikit-learn

Create notebook kernel

python -m ipykernel install --user --name py368 --display-name "Python 3.6.8"

Pull repo

sudo yum install git
git clone https://github.com/jstremme/DATA512-Research.git

PYSPARK Configuration

export PYTHONPATH="/home/hadoop/.local/lib/python3.6/site-packages:$PYTHONPATH"
export PYSPARK_DRIVER_PYTHON=/home/hadoop/.local/bin/jupyter
export PYSPARK_DRIVER_PYTHON_OPTS='notebook --no-browser'
export PYSPARK_PYTHON=/usr/bin/python3
echo $PYTHONPATH
echo $PYSPARK_DRIVER_PYTHON
echo $PYSPARK_DRIVER_PYTHON_OPTS
pyspark
`

mgummelt closed this as completed Oct 19, 2016

mgummelt reopened this Oct 31, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot specify Python version when launching PySpark jobs #73

Cannot specify Python version when launching PySpark jobs #73

mooperd commented Oct 18, 2016 •

edited

Loading

mgummelt commented Oct 19, 2016

mooperd commented Oct 19, 2016 •

edited

Loading

mgummelt commented Oct 19, 2016

mooperd commented Oct 19, 2016 •

edited

Loading

mooperd commented Oct 21, 2016

jstremme commented Dec 3, 2019 •

edited

Loading

Cannot specify Python version when launching PySpark jobs #73

Cannot specify Python version when launching PySpark jobs #73

Comments

mooperd commented Oct 18, 2016 • edited Loading

mgummelt commented Oct 19, 2016

mooperd commented Oct 19, 2016 • edited Loading

mgummelt commented Oct 19, 2016

mooperd commented Oct 19, 2016 • edited Loading

Test on AWS EMR

mooperd commented Oct 21, 2016

jstremme commented Dec 3, 2019 • edited Loading

Update package manager

Install Anaconda - you may need to close and reopen your shell after this

Create virtual environment

Install Python packages

Create notebook kernel

Pull repo

PYSPARK Configuration

mooperd commented Oct 18, 2016 •

edited

Loading

mooperd commented Oct 19, 2016 •

edited

Loading

mooperd commented Oct 19, 2016 •

edited

Loading

jstremme commented Dec 3, 2019 •

edited

Loading