how to install pyspark on win10 in 2020

Erjan G.
3 min readJun 15, 2020

--

This is a short how-to to get apache spark up and running in win10.

  1. install java jdk( not jre) from here: https://www.oracle.com/java/technologies/javase-jdk14-downloads.html. For the installation path choose some folder that **has no spaces** in it — or it will give you errors afterwards. Not like “C:/Program files (x86)/Java” but something like “C:/Java”
  1. install anaconda — https://www.anaconda.com/products/individual

anaconda will install python and other packages, we need only python.

Check java is there — go to cmd prompt( win+r , then ‘cmd’)

type “java -version”, this should give:

java version "1.8.0_92"

Java(TM) SE Runtime Environment (build 1.8.0_92-b14)

Java HotSpot(TM) 64-Bit Server VM (build 25.92-b14, mixed mode)

Check python is there — go to cmd and type:

python — — version, this should give you :

Python 3.7.6

Now it’s time to go to spark downloads.

Here choose your download options like this:

There is no installer so you need to create folder where spark will be on your computer.

Create a folder(e.g. on Desktop) , name it “spark” and unzip it there.

the full path to apache spark on your machine will be like:

C:\Users\admin\Desktop\Spark\spark-2.4.0-bin-hadoop2.7

Now here is the most important thing:

setting your java_home env path:

  1. go to system settings
  2. click on “system”
  3. click on about like here:

in the upper corner you will see “related settings”

This will take you to “advanced system settings”

Click on environment variables

here create a “java_home” env var — give it a path to the folder where java is installed.

Don’t give the path to the bin! The path should go only up to jdk.

Create SPARK_HOME dir — where the spark is stored on your computer.

In addition you need to install winutils.exe from this git repo:

Choose according to the version number in your spark directory —

if it ends with \spark-2.4.6-bin-hadoop2.6 — go grab ‘hadoop-2.6.0\bin’

Create hadoop/bin folder inside your spark and put winutils.exe file in there.

Create env variable “HADOOP_HOME” like this:

Now you can try running your apache spark and see if it works:

  1. go to anaconda cmd prompt
  2. navigate to the spark folder
  3. type “bin\pyspark”
  4. this should give you this printed “apache spark” logo:

Now you are good to go!

The most important thing is to install java in a path without spaces or quotes.

Don’t include “bin” in the JAVA_HOME env var.

--

--