Setting up Spark on Windows and PySpark

Complete the following steps to install PySpark on a Windows machine:

  1.  Download Gnu on Windows (GOW) from https://github.com/bmatzelle/gow/releases/download/v0.8.0/Gow-0.8.0.exe.
  2. GOW allows the use of Linux commands on Windows. We can use the following command to see the basic Linux commands allowed by installing GOW:
gow --list

This gives the following output:

  1. Download and install Anaconda. If you need help, you can go through the following tutorial: https://medium.com/@GalarnykMichael/install-python-on-windows-anaconda-c63c7c3d1444.
  2. Close the previous command line and open a new command line.
  3. Go to the Apache Spark website (https://spark.apache.org/).
  4. To download Spark, choose the following from the drop-down menu:
    • A recent Spark release
    • A proper package type

The following screenshot shows the download page of Apache Spark:

  1. Then, download Spark. Once it is downloaded, move the file to the folder where you want to unzip it.
  2. You can either unzip it manually or use the following commands:
gzip -d spark-2.1.0-bin-hadoop2.7.tgz
tar xvf spark-2.1.0-bin-hadoop2.7.tar
  1. Now, download winutils.exe into your spark-2.1.0-bin-hadoop2.7\bin folder using the following command:
curl -k -L -o winutils.exe https://github.com/steveloughran/winutils/blob/master/hadoop-2.6.0/bin/winutils.exe?raw=true
  1. Make sure you have Java installed on your machine. You can use the following command to see the Java version:
java --version

This gives the following output:

  1. Check for the Python version by using the following command:
python --version 

This gives the following output:

  1. Let's edit our environmental variables so that we can open Spark in any directory, as follows:
setx SPARK_HOME C:\opt\spark\spark-2.1.0-bin-hadoop2.7
setx HADOOP_HOME C:\opt\spark\spark-2.1.0-bin-hadoop2.7
setx PYSPARK_DRIVER_PYTHON ipython
setx PYSPARK_DRIVER_PYTHON_OPTS notebook

Add C:\opt\spark\spark-2.1.0-bin-hadoop2.7\bin to your path.

  1. Close the Terminal, open a new one, and type the following command:
--master local[2]
The PYSPARK_DRIVER_PYTHON and the  PYSPARK_DRIVER_PYTHON_OPTS parameters are used to launch the PySpark shell in Jupyter Notebook. The --master parameter is used for setting the master node address.
  1. The next thing to do is to run the PySpark command in the bin folder:
.\bin\pyspark

This gives the following output: