First of all, one has to download the Spark APIs for Python to be able to import its modules and define a Spark session. A possible way to do it will be presented in the following.
Since Pyspark will translate the Python instructions into Java bytecodes that will be executed in the JVM of each node of the cluster, a Java Runtime Environment must be available throughout the system with its compiler and its APIs. Head to Oracle download page and download the Java Development Kit version 8. Version 9 (or leter versions) might give problems, so we should avoid them for the time being.
In order to download the Spark libraries, it is sufficient to open a terminal and to type
$ pip install pyspark
This will also take care of installing the dependencies (e.g. py4j).
Remark: if conda is installed, one can equivalently use its package manager, writing the command
$ conda install pyspark
$ nano ~/.bashrc
and add the following lines to the file
export JAVA_HOME="/path/to/JDK/" export SPARK_HOME="/path/to/Spark/" export PATH=$JAVA_HOME/bin:$SPARK_HOME:$SPARK_HOME/bin:$PATH export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/build:$PYTHONPATH
In order to make the changes have effect, one can just close and reopen the terminal or give the command
$ source ~/.bashrc
Under Windows the procedure is slightly different:
If the installation ended properly, the call of Pyspark in the terminal
$ pyspark
will show something like
Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 2.3.0 /_/ Using Python version 3.6.4 (default, Jan 16 2018 18:10:19) SparkSession available as 'spark'. >>>
To be able to use the Spark libraries in a Python code *.py
one must define the Spark context, including in the code the following instructions
from pyspark import SparkContext, SparkConf conf = SparkConf().setAppName("appName").setMaster("master") sc = SparkContext.getOrCreate(conf = conf)
where appName is a string that specifies the name of the app to be
shown in the cluster execution, and "master" must be substituted with
"local[*]", if the program is run locally, or with
the approriate cluster manager specification
(e.g., "yarn"), if the program is run on a cluster.
The actual execution of a program MyProgram.py
on some program-parameters
will be made using the command
$ python path-to-program/MyProgram.py program-parameters
It is possible to automatically create the Spark context and to use a Jupyter notebook, if already installed, with the command
$ PYSPARK_DRIVER_PYTHON=jupyter PYSPARK_DRIVER_PYTHON_OPTS=notebook pyspark
The linking can also be made permanent adding to the ~/.bashrc the lines
export PYSPARK_DRIVER_PYTHON="jupyter" export PYSPARK_DRIVER_PYTHON_OPTS="notebook"