Data understanding and preparation

To begin, we will load the dataset named water and define the structure of the str() function as follows:

    > data(water)

> str(water)
'data.frame': 43 obs. of 8 variables:
$ Year : int 1948 1949 1950 1951 1952 1953 1954
1955 1956 1957 ...

$ APMAM : num 9.13 5.28 4.2 4.6 7.15 9.7 5.02 6.7
10.5 9.1 ...

$ APSAB : num 3.58 4.82 3.77 4.46 4.99 5.65 1.45
7.44 5.85 6.13 ...

$ APSLAKE: num 3.91 5.2 3.67 3.93 4.88 4.91 1.77
6.51 3.38 4.08 ...

$ OPBPC : num 4.1 7.55 9.52 11.14 16.34 ...
$ OPRC : num 7.43 11.11 12.2 15.15 20.05 ...
$ OPSLAKE: num 6.47 10.26 11.35 11.13 22.81 ...
$ BSAAM : int 54235 67567 66161 68094 107080
67594 65356 67909 92715 70024 ...

Here we have eight features and one response variable, BSAAM. The observations start in 1943 and run for 43 consecutive years. Since for this exercise we are not concerned with what year the observations occurred in, it makes sense to create a new data frame excluding the year vector. This is quite easy to do. With one line of code, we can create the new data frame, and then verify that it works with the head() function:

    > socal.water <- water[ ,-1] #new dataframe with 
the deletion of
column 1


> head(socal.water)
APMAM APSAB APSLAKE OPBPC OPRC OPSLAKE BSAAM
1 9.13 3.58 3.91 4.10 7.43 6.47 54235
2 5.28 4.82 5.20 7.55 11.11 10.26 67567
3 4.20 3.77 3.67 9.52 12.20 11.35 66161
4 4.60 4.46 3.93 11.14 15.15 11.13 68094
5 7.15 4.99 4.88 16.34 20.05 22.81 107080
6 9.70 5.65 4.91 8.88 8.15 7.41 67594

With all the features being quantitative, it makes sense to look at the correlation statistics and then produce a matrix of scatterplots. The correlation coefficient or Pearson's r, is a measure of both the strength and direction of the linear relationship between two variables. The statistic will be a number between -1 and 1, where -1 is the total negative correlation and +1 is the total positive correlation. The calculation of the coefficient is the covariance of the two variables Pided by the product of their standard deviations. As previously discussed, if you square the correlation coefficient, you will end up with R-squared.

There are a number of ways to produce a matrix of correlation plots. Some prefer to produce heatmaps, but I am a big fan of what is produced with the corrplot package. It can produce a number of different variations including ellipse, circle, square, number, shade, color, and pie. I like the ellipse method, but feel free to experiment with the others. Let's load the corrplot package, create a correlation object using the base cor() function, and examine the following results:

    > library(corrplot)

> water.cor <- cor(socal.water)

> water.cor
APMAM APSAB APSLAKE OPBPC
APMAM 1.0000000 0.82768637 0.81607595 0.12238567
APSAB 0.8276864 1.00000000 0.90030474 0.03954211
APSLAKE 0.8160760 0.90030474 1.00000000 0.09344773
OPBPC 0.1223857 0.03954211 0.09344773 1.00000000
OPRC 0.1544155 0.10563959 0.10638359 0.86470733
OPSLAKE 0.1075421 0.02961175 0.10058669 0.94334741
BSAAM 0.2385695 0.18329499 0.24934094 0.88574778
OPRC OPSLAKE BSAAM
APMAM 0.1544155 0.10754212 0.2385695
APSAB 0.1056396 0.02961175 0.1832950
APSLAKE 0.1063836 0.10058669 0.2493409
OPBPC 0.8647073 0.94334741 0.8857478
OPRC 1.0000000 0.91914467 0.9196270
OPSLAKE 0.9191447 1.00000000 0.9384360
BSAAM 0.9196270 0.93843604 1.0000000

So, what does this tell us? First of all, the response variable is highly and positively correlated with the OP features with OPBPC as 0.8857, OPRC as 0.9196, and OPSLAKE as 0.9384. Also note that the AP features are highly correlated with each other and the OP features as well. The implication is that we may run into the issue of multi-collinearity. The correlation plot matrix provides a nice visual of the correlations as follows:

    > corrplot(water.cor, method = "ellipse")

The output of the preceding code snippet is as follows:

Another popular visual is a scatterplot matrix. This can be called with the pairs() function. It reinforces what we saw in the correlation plot in the previous output:

    > pairs(~ ., data = socal.water)

The output of the preceding code snippet is as follows: