Getting the data from the repository to Spark

We can follow these steps to download the dataset and load it in PySpark:

Click on Data Folder.
You will be redirected to a folder that has various files as follows:

You can see that there's kddcup.data.gz, and there is also 10% of that data available in kddcup.data_10_percent.gz. We will be working with food datasets. To work with the food datasets, right-click on kddcup.data.gz, select Copy link address, and then go back to the PySpark console and import the data.

Let's take a look at how this works using the following steps:

After launching PySpark, the first thing we need to do is import urllib, which is a library that allows us to interact with resources on the internet, as follows:

import urllib.request

The next thing to do is use this request library to pull some resources from the internet, as shown in the following code:

f = urllib.request.urlretrieve("https://archive.ics.uci.edu/ml/machine-learning-databases/kddcup99-mld/kddcup.data.gz"),"kddcup.data.gz"

This command will take some time to process. Once the file has been downloaded, we can see that Python has returned and the console is active.

Next, load this using SparkContext. So, SparkContext is materialized or objectified in Python as the sc variable, as follows:

sc

This output is as demonstrated in the following code snippet:

SparkContext
Spark UI
Version
 v2.3.3
Master
 local[*]
AppName
 PySparkShell