Skip to content
This repository has been archived by the owner on Dec 4, 2024. It is now read-only.

Cannot specify Python version when launching PySpark jobs #73

Open
mooperd opened this issue Oct 18, 2016 · 6 comments
Open

Cannot specify Python version when launching PySpark jobs #73

mooperd opened this issue Oct 18, 2016 · 6 comments

Comments

@mooperd
Copy link

mooperd commented Oct 18, 2016

Whilst trying to use python3 as the PySpark driver I have found that PYSPARK_PYTHON & PYSPARK_DRIVER_PYTHON seem to be ignored when launching jobs using the dcos spark tool.

According to the spark documentation "SPARK_HOME/conf/spark-env.sh can be used to set various variables when launching spark jobs on mesos. - http://spark.apache.org/docs/latest/configuration.html#environment-variables

I have copied
~/.dcos/spark/dist/spark-2.0.0/conf/spark-env.sh.template
to
~/.dcos/spark/dist/spark-2.0.0/conf/spark-env.sh

and added the lines:

export PYSPARK_PYTHON = python3
export PYSPARK_DRIVER_PYTHON = python3

I have also tried the following in the shell before running jobs:

export PYSPARK_PYTHON = python3
export PYSPARK_DRIVER_PYTHON = python3

Also putting these directly in the spark-submit shell script does not work which brings me to the conclusion that these environment variables are being stripped out somewhere. I don't see any errors anywhere.

I'm testing the python version with:

version = sys.version
log4jLogger = sc._jvm.org.apache.log4j
LOGGER = log4jLogger.LogManager.getLogger(__name__)
LOGGER.info("pyspark script logger initialized")
LOGGER.info("Python Version: " + version)
@mgummelt
Copy link
Contributor

~/.dcos/spark/dist contains your local distribution of Spark, but it has no effect on the driver nor executors, which all run inside the docker image in the cluster. You'd have to modify spark-env.sh in the docker image.

@mooperd
Copy link
Author

mooperd commented Oct 19, 2016

@mgummelt These seems like a somewhat backwards way to use spark. Typically one should have control over these variables when starting jobs.

I think this would mean that spark applications are not easy to port to DC/OS

@mgummelt
Copy link
Contributor

Can you give me an example of how you would set these outside of DC/OS?

When submitting in cluster mode, I'm not aware of any other system (YARN, Standalone) that forwards along those environment variables to the driver.

@mooperd
Copy link
Author

mooperd commented Oct 19, 2016

It is common to switch your version of python using the PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON environment variables and I use this often. Its most common when testing between python2 and python3 however in one specific case I have seen 3 different versions of Anaconda Python installed on a Hadoop cluster with different dependancies and custom modules set up.

The spark documentation also tells that these variables should be controllable when using spark-submit - http://spark.apache.org/docs/latest/configuration.html#environment-variables - But its a bit confusing as yarn seems to have about three hundred submission modes.

From the cloudera is the following passage:

In the best possible world, you have a good relationship with your local sysadmin and they are able and willing to set up a virtualenv or install the Anaconda distribution of Python on every node of your cluster, with your required dependencies. If you are a data scientist responsible for administering your own cluster, you may need to get creative about setting up your required Python environment on your cluster. If you have sysadmin or devops support for your cluster, use it!

http://blog.cloudera.com/blog/2015/09/how-to-prepare-your-apache-hadoop-cluster-for-pyspark-jobs/

Test on AWS EMR

test.py

import sys
from pyspark import SparkContext

sc = SparkContext()
version = sys.version
log4jLogger = sc._jvm.org.apache.log4j
LOGGER = log4jLogger.LogManager.getLogger(__name__)
LOGGER.info("pyspark script logger initialized")
LOGGER.info("Python Version: " + version)
sc.stop()
exit()

Run without VARS set

[hadoop@ip-10-141-1-236 ]$ spark-submit test.py
</snip>
16/10/19 20:47:16 INFO __main__: Python Version: 2.7.10 (default, Jul 20 2016, 20:53:27)
</snip>

Run with VARS set

[hadoop@ip-10-141-1-236 ]$ export PYSPARK_PYTHON=python3.4
[hadoop@ip-10-141-1-236 ]$ export PYSPARK_DRIVER_PYTHON=python3.4
[hadoop@ip-10-141-1-236 ]$ spark-submit test.py
</snip>
16/10/19 20:49:47 INFO __main__: Python Version: 3.4.3 (default, Jul 20 2016, 21:31:36) 
</snip>

@mooperd
Copy link
Author

mooperd commented Oct 21, 2016

@mgummelt - could we reopen this issue?

@mgummelt mgummelt reopened this Oct 31, 2016
@jstremme
Copy link

jstremme commented Dec 3, 2019

I came across this recently when using AWS EMR and was able to get set up with a Python 3.6.8 driver to match the version of my worker nodes with the following steps after SSHing into the master node:

`

Update package manager

sudo yum update

Install Anaconda - you may need to close and reopen your shell after this

wget https://repo.continuum.io/archive/Anaconda3-2019.10-Linux-x86_64.sh
sh Anaconda3-2019.10-Linux-x86_64.sh

Create virtual environment

conda create -n py368 python=3.6.8
source activate py368

Install Python packages

pip install --user jupyter
pip install --user ipython
pip install --user ipykernel
pip install --user numpy
pip install --user pandas
pip install --user matplotlib
pip install --user scikit-learn

Create notebook kernel

python -m ipykernel install --user --name py368 --display-name "Python 3.6.8"

Pull repo

sudo yum install git
git clone https://github.com/jstremme/DATA512-Research.git

PYSPARK Configuration

export PYTHONPATH="/home/hadoop/.local/lib/python3.6/site-packages:$PYTHONPATH"
export PYSPARK_DRIVER_PYTHON=/home/hadoop/.local/bin/jupyter
export PYSPARK_DRIVER_PYTHON_OPTS='notebook --no-browser'
export PYSPARK_PYTHON=/usr/bin/python3
echo $PYTHONPATH
echo $PYSPARK_DRIVER_PYTHON
echo $PYSPARK_DRIVER_PYTHON_OPTS
pyspark
`

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants