How to use Spark with Python

First of all, one has to download the Spark APIs for Python to be able to import its modules and define a Spark session. A possible way to do it will be presented in the following.

Step 1: Download and install a JDK

Since Pyspark will translate the Python instructions into Java bytecodes that will be executed in the JVM of each node of the cluster, a Java Runtime Environment must be available throughout the system with its compiler and its APIs. Head to Oracle download page and download the Java Development Kit version 8. Version 9 (or leter versions) might give problems, so we should avoid them for the time being.

Step 2: Use pip

In order to download the Spark libraries, it is sufficient to open a terminal and to type


$ pip install pyspark

This will also take care of installing the dependencies (e.g. py4j).

Remark: if conda is installed, one can equivalently use its package manager, writing the command $ conda install pyspark

Step 3: Configure the environment variables

As an intermediate passage it is required to find the installation directory of Spark and Java.
By default, under Unix-based systems, Java and Pyspark should be respectively found under "/usr/lib/jvm/*" and "/usr/lib/python*/dist-packages/pip". If Anaconda is present on the computer, then both its package manager and pip will install pyspark in one of its subdirectories.

Linux and OS x instructions

Run in a terminal


$ nano ~/.bashrc

and add the following lines to the file

export JAVA_HOME="/path/to/JDK/"          
export SPARK_HOME="/path/to/Spark/"  
export PATH=$JAVA_HOME/bin:$SPARK_HOME:$SPARK_HOME/bin:$PATH   
export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/build:$PYTHONPATH

In order to make the changes have effect, one can just close and reopen the terminal or give the command

$ source ~/.bashrc

Windows instructions

Under Windows the procedure is slightly different:

Download the Windows binaries for Hadoop from this site .
Open the Windows tool to modify the environment variables (it can be found searching for such variables in the start panel)
Add "path_to_Hadoop\bin" to the environment variable PATH
Create a new variable HADOOP_HOME with value "path_to_Hadoop". You may need to restart the machine to make sure that the changes are applied.

Step 4: Verify the installation

If the installation ended properly, the call of Pyspark in the terminal

$ pyspark

will show something like

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.3.0
      /_/

Using Python version 3.6.4 (default, Jan 16 2018 18:10:19)
SparkSession available as 'spark'.
>>>

Usage example

To be able to use the Spark libraries in a Python code *.py one must define the Spark context, including in the code the following instructions

from pyspark import SparkContext, SparkConf 
conf = SparkConf().setAppName("appName").setMaster("master") 
sc = SparkContext.getOrCreate(conf = conf)

where appName is a string that specifies the name of the app to be shown in the cluster execution, and "master" must be substituted with "local[*]", if the program is run locally, or with the approriate cluster manager specification (e.g., "yarn"), if the program is run on a cluster.

The actual execution of a program MyProgram.py on some program-parameters will be made using the command

 
    $ python path-to-program/MyProgram.py program-parameters

Appendix: configuration of Jupyter

It is possible to automatically create the Spark context and to use a Jupyter notebook, if already installed, with the command

$ PYSPARK_DRIVER_PYTHON=jupyter PYSPARK_DRIVER_PYTHON_OPTS=notebook pyspark

The linking can also be made permanent adding to the ~/.bashrc the lines

export PYSPARK_DRIVER_PYTHON="jupyter"  
export PYSPARK_DRIVER_PYTHON_OPTS="notebook"