- Hands-On Big Data Analytics with PySpark
- Rudy Lai Bart?omiej Potaczek
- 229字
- 2021-06-24 15:52:34
Getting the data from the repository to Spark
We can follow these steps to download the dataset and load it in PySpark:
- Click on Data Folder.
- You will be redirected to a folder that has various files as follows:
You can see that there's kddcup.data.gz, and there is also 10% of that data available in kddcup.data_10_percent.gz. We will be working with food datasets. To work with the food datasets, right-click on kddcup.data.gz, select Copy link address, and then go back to the PySpark console and import the data.
Let's take a look at how this works using the following steps:
- After launching PySpark, the first thing we need to do is import urllib, which is a library that allows us to interact with resources on the internet, as follows:
import urllib.request
- The next thing to do is use this request library to pull some resources from the internet, as shown in the following code:
f = urllib.request.urlretrieve("https://archive.ics.uci.edu/ml/machine-learning-databases/kddcup99-mld/kddcup.data.gz"),"kddcup.data.gz"
This command will take some time to process. Once the file has been downloaded, we can see that Python has returned and the console is active.
- Next, load this using SparkContext. So, SparkContext is materialized or objectified in Python as the sc variable, as follows:
sc
This output is as demonstrated in the following code snippet:
SparkContext
Spark UI
Version
v2.3.3
Master
local[*]
AppName
PySparkShell